Processing Multiple RSS Feeds in PHP - php

I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.

15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.

fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.

Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)

Related

Long iterative calculations in PHP/MySQL/Apache web-app: fight with timeouts

we have a PHP/MySQL/Apache Web app which holds a rating system. From time to time we do full recalculations for ratings, which means about 500 iterations of calculation, each taking 4-6 minutes and depending on the results of previous iteration (i.e., parallel solutions are not possible). Time is taken mostly by MySQL queries and loops for each rated player (about 100000 players on each iteration, but complex logic of linking between players gives no possibility for parallelization here also).
The problem is - when we start recalculation in plain old way (one PHP POST request), it dies after about 30-40 minutes from start (which gives only 10-15 iterations completed). The question "why it dies?" and other optimization issues are kinda out of league now - too complex logic, which needs to be refactored and even maybe rewritten in other language/infrastructure, yes, but we have no resources (time/people) for it now. We just need to make things work in the least annoying way.
So, the question: what is the best way to organize such recalculation, if possible, so that site admin can start recalculation by just one click and forget about it for one day, and it still does the thing?
I found on the web few advices for similar problems, but no silver bullet:
move iterations (and, therefore, timeouting) from server to client with usage of AJAX requests instead of plain old PHP requst - could possibly make the browser freeze (and AJAX's async nature is kinda bad for iterations);
make PHP to start a backend service which does the thing (like advised here) - it should take lot of work and I have no idea how to implement it.
So, I humbly ask for any advices possible in such situation.

PHP, Calculating and Displaying Daily, Weekly and Monthly Stats

I am a newbie with PHP and therefore this is more of a conceptual question or maybe even a question about 'best practices'.
Often, I see websites with stats drawn from their database. For example, let's say it is a sales lead website. It may have stats at the top of the page like:
NEW SALES LEADS YESTERDAY: 123
NEW SALES LEADS THIS MONTH: 556
NEW SALES LEADS THIS YEAR: 3870
Obviously, this should not be calculated everytime the page is displayed, right? That would potentially be a large burden on the server? How do people cache this type of data. Any best practices? I thought I writing a CRON jobs that would calculate it on a daily basis and insert to a database. What are your ideas? Thank you!
You can calculate it once and then store it in a xcache. Here, however there doesn't seem to be a need for a cron. The query can run one time and store the result in xcache. Important thing here would be to set the expiration time of the stored value according to your use case. For eg. if you need to store daily stats like above, set the expiration time to be a few hours. In case of data which gets updated every minute, you can set the expiration time to be a few minutes.
Something like this.
$newSalesLeadYest;
if(xcache_isset("newSalesLeadYest")){
$newSalesLeadYest = xcache_get("newSalesLeadYest");
} else{
$newSalesLeadYest = runQueryToFetchStat();
//Cache set for X secs
xcache_set("newSalesLeadYest", $newSalesLeadYest, X);
}
What you need is to come up with a caching strategy.
Some factors to help you decide:
How frequent does the data change?
How important is the current values - is it ok if it's 1min, 1hr, 1day old?
How expensive, time wise, is loading fresh data?
How much traffic are you getting? 10s, 100s, millions?
There are a few ways you can achieve the result.
You can use something like memcached to persist the data to avoid it being generated each request.
You can use http caching and load the data client side using javascript from an api.
You can have a background worker (eg. run by cron), which generates the latest figures and persists to a lookup database table.
You could improve the queries and indexes so that getting live data is fast enough to do every request
You could alter you database schema so that you have more static data
From the 3 examples you gave, 3 simple counts should not be expensive enough to warrant complex caching systems. If you can paste the sql queries, we can help optimise them.
The data sounds like it will only get updated once per day, so a simple nightly cron "flatten" query would be a nice fit.

Should I be using message queuing for this?

I have a PHP application that currently has 5k users and will keep increasing for the forseeable future. Once a week I run a script that:
fetches all the users from the database
loops through the users, and performs some upkeep for each one (this includes adding new DB records)
The last time this script ran, it only processed 1400 users before dieing due to a 30 second maximum execute time error. One solution I thought of was to have the main script still fetch all the users, but instead of performing the upkeep process itself, it would make an asynchronous cURL call (1 for each user) to a new script that will perform the upkeep for that particular user.
My concern here is that 5k+ cURL calls could bring down the server. Is this something that could be remedied by using a messaging queue instead of cURL calls? I have no experience using one, but from what I've read it seems like this might help. If so, which message queuing system would you recommend?
Some background info:
this is a Symfony project, using Doctrine as my ORM and MySQL as my DB
the server is a Windows machine, and I'm using Windows' task scheduler and wget to run this script automatically once per week.
Any advice and help is greatly appreciated.
If it's possible, I would make a scheduled task (cron job) that would run more often and use LIMIT 100 (or some other number) to process a limited number of users at a time.
A few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but more than 30 seconds would be a start.
Track Upkeep against Users
Maybe add a field for each user, last_check and have that field set to the date/time of the last successful "Upkeep" action performed against that user.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
Why don't you still use the cURL idea, but instead of processing only one user for each, send a bunch of users to one by splitting them into groups of 1000 or something.
Have you considered changing your logic to commit changes as you process each user? It sounds like you may be running a single transaction to process all users, which may not be necessary.
How about just increasing the execution time limit of PHP?
Also, looking into if you can improve your upkeep-procedure to make it faster can help too. Depending on what exactly you are doing, you could also look into spreading it out a bit. Do a couple once in a while rather than everyone at once. But depends on what exactly you're doing of course.

Most effiecient way to display some stats using PHP/MySQL

I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.

Cron job for big data

I'm working on a social network like Friendfeed. When user add his feed links, I use a cron job to parse each user feed. Is this possible with large number of users, like parsing 10.000 links each hour or will that cause problems? If it isn't possible, what is used on Friendfeed or RSS readers to do that?
You might consider adding some information about your hardware to your question, this makes a big difference for someone looking to advise you on how easily your implementation will scale.
If you end up parsing millions of links, one big cron job is going to become problematic. I am assuming you are doing the following (if not, you probably should):
Realizing when users subscribe to the same feed, to avoid fetching it twice.
When fetching a new feed, check for the existence of a site map that tells you how often the feed is likely to change, re-visit that value on a sensible interval
Checking system load and memory usage to know when to 'back off' and go to sleep for a while.
This reduces the amount of sweat that an hourly cron would produce.
If you are harvesting millions of feeds, you'll probably want to distribute that work, something that you might want to keep in mind while you're still desigining your database.
Again, please update your question with details on the hardware you are using and how big your solution needs to scale. Nothing scales 'infinitely', so please be realistic :)
Don't have quite enough information to judge whether this design is good or not, but to answer the basic question, unless you are doing some very intensive processing on 10k questions, that should be trivial for an hourly cron job to handle.
More information on how you process the feeds, and in particular how the process scales with respect to number of users who have feeds and number of feeds per user, would be useful in giving you further advice.
Your limiting factor will be the network access to these 10,000 feeds. You could process the feeds serially and likely do 10,000 in an hour (you'd need to average about 350ms latency).
Of course you'd want to have more than one process doing the work simultaneously to speed things up.
What ever solution you select, if you meet success (which I hope), you will have performance issue.
As the founder of FF said many times: the only solution to select the best actual solution is to profile/measure. With numbers the choice will be obvious.
So: build a test architecture close to your expected (=realistic) situation in a few months and profile/measure.
You might want to consider checking out IronWorker for big data jobs like this. It's made for it and since it's a service you don't need to deal with servers or scale. It has scheduling built in so you would schedule a worker task to run each hour and that task can then queue up 10,000 other jobs and run them all in parallel.

Categories