Cron job for big data - php

I'm working on a social network like Friendfeed. When user add his feed links, I use a cron job to parse each user feed. Is this possible with large number of users, like parsing 10.000 links each hour or will that cause problems? If it isn't possible, what is used on Friendfeed or RSS readers to do that?

You might consider adding some information about your hardware to your question, this makes a big difference for someone looking to advise you on how easily your implementation will scale.
If you end up parsing millions of links, one big cron job is going to become problematic. I am assuming you are doing the following (if not, you probably should):
Realizing when users subscribe to the same feed, to avoid fetching it twice.
When fetching a new feed, check for the existence of a site map that tells you how often the feed is likely to change, re-visit that value on a sensible interval
Checking system load and memory usage to know when to 'back off' and go to sleep for a while.
This reduces the amount of sweat that an hourly cron would produce.
If you are harvesting millions of feeds, you'll probably want to distribute that work, something that you might want to keep in mind while you're still desigining your database.
Again, please update your question with details on the hardware you are using and how big your solution needs to scale. Nothing scales 'infinitely', so please be realistic :)

Don't have quite enough information to judge whether this design is good or not, but to answer the basic question, unless you are doing some very intensive processing on 10k questions, that should be trivial for an hourly cron job to handle.
More information on how you process the feeds, and in particular how the process scales with respect to number of users who have feeds and number of feeds per user, would be useful in giving you further advice.

Your limiting factor will be the network access to these 10,000 feeds. You could process the feeds serially and likely do 10,000 in an hour (you'd need to average about 350ms latency).
Of course you'd want to have more than one process doing the work simultaneously to speed things up.

What ever solution you select, if you meet success (which I hope), you will have performance issue.
As the founder of FF said many times: the only solution to select the best actual solution is to profile/measure. With numbers the choice will be obvious.
So: build a test architecture close to your expected (=realistic) situation in a few months and profile/measure.

You might want to consider checking out IronWorker for big data jobs like this. It's made for it and since it's a service you don't need to deal with servers or scale. It has scheduling built in so you would schedule a worker task to run each hour and that task can then queue up 10,000 other jobs and run them all in parallel.

Related

Long iterative calculations in PHP/MySQL/Apache web-app: fight with timeouts

we have a PHP/MySQL/Apache Web app which holds a rating system. From time to time we do full recalculations for ratings, which means about 500 iterations of calculation, each taking 4-6 minutes and depending on the results of previous iteration (i.e., parallel solutions are not possible). Time is taken mostly by MySQL queries and loops for each rated player (about 100000 players on each iteration, but complex logic of linking between players gives no possibility for parallelization here also).
The problem is - when we start recalculation in plain old way (one PHP POST request), it dies after about 30-40 minutes from start (which gives only 10-15 iterations completed). The question "why it dies?" and other optimization issues are kinda out of league now - too complex logic, which needs to be refactored and even maybe rewritten in other language/infrastructure, yes, but we have no resources (time/people) for it now. We just need to make things work in the least annoying way.
So, the question: what is the best way to organize such recalculation, if possible, so that site admin can start recalculation by just one click and forget about it for one day, and it still does the thing?
I found on the web few advices for similar problems, but no silver bullet:
move iterations (and, therefore, timeouting) from server to client with usage of AJAX requests instead of plain old PHP requst - could possibly make the browser freeze (and AJAX's async nature is kinda bad for iterations);
make PHP to start a backend service which does the thing (like advised here) - it should take lot of work and I have no idea how to implement it.
So, I humbly ask for any advices possible in such situation.

How can I scale a database/CPU intensive script?

I currently have a PHP script that collects similar data from various sources, each data source is scraped and parsed every 120 seconds. At the moment I have 20 data sources, but I expect to integrate another 100 over the coming weeks.
Currently each data source is scraped in it's own thread, there is one main PHP script that will execute other scripts to perform the scraping work. This method allows all sources to be scraped at the same time, but it also puts a strain on the server, and a bottleneck on the database (MySQL).
I'm looking for a way to scale my current application, could I do something like this with AWS? Perhaps each of these scraping scripts could run in their own small server instance, each of these instances would be automatically created by a "main" instance and then die once the script has finished. I don't have any experience with AWS, so I'm not entirely sure if this is possible, or maybe it's just a bad idea.
The main question here is: How can I scale my current scraping script to allow for many new data sources? I'm interested in any solution even if I need to buy additional services.
You need a queueing system
You're describing a sort of worker / queue pattern, with your main server performing both the en-queueing and the worker execution, which of course is going to be a huge strain on your server.
First and foremost, your workers need to be asynchronous: you shouldn't be waiting for something that may or may not come back. You really should take a look at ZeroMQ which, I might add, contains some of the best documentation on the planet. If you're willing to learn, take a look at how this works and follow some tutorials, there are plenty out there. Have your queue taking on new jobs and dispatching others elsewhere (i.e. to other boxes) hosted on your main server.
Horizontal Scaling
You can create some sort of Instance Controller to handle AWS instances. You really just need to sit down and think about your logic (when do I want this many boxes, when do I want to shut them down). The API is pretty simple to use once you get your head around it. Here's some code I wrote a while back to wrap Amazon's SDK for PHP. I'm not sure if it's working 100% with the latest version (I used it around a year ago), but the concepts are there - you have simple methods like startBox() or stopBox() that you call from your queue, and have your box automatically start doing it's stuff once it starts up.
You could use the t1.micro instances from Amazon pricing here, which has a free tier info here up to a certain limit.
Get it working properly, with a loop on your main server deciding how many boxes you need working at any one time given certain circumstances (no. of jobs in your database table, for example), and you'll have theoretically infinite scaling. Here's how I did it for my code:
Tier 1: > 5 jobs, < 10 jobs = 1 box
Tier 2: > 10 jobs, < 20 jobs = 2 boxes
etc. etc.
Advice
Log everything. Log every box coming up, every box coming down. Calculate your costs in your code and store them, maybe in a database, or log them, so you know exactly how much you're spending - your don't want things to get out of hand.
Make sure you open up your DB ports so your instances can talk to your DB to say when a job is done or anything else you need to pass between your "master" box and your "slave" boxes.
Also, if you're paying for web servers, you'll be billed for the hour with aws, so you need to get the time you start the box, and when it's time to shut down, only actually shut it down when 55 minutes or so has passed - you might as well get those extra minutes for what you're paying.
I can't really think of anything else. Do your research, figure out the best way to build a queueing system, and build it with scalability in mind (it can react and change to numbers that you control).
Split your scraping up across multiple instances (say 5 per server) and have them talk to a central DB like Amazon RDS.
No need to kill the instances after you have finished scraping if your doing this every 120 seconds.

Processing Multiple RSS Feeds in PHP

I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.
15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.
Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)

Most effiecient way to display some stats using PHP/MySQL

I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.

Need advice on cron job'ing a very large process

I have a PHP script that grabs data from an external service and saves data to my database. I need this script to run once every minute for every user in the system (of which I expect to be thousands). My question is, what's the most efficient way to run this per user, per minute? At first I thought I would have a function that grabs all the user Ids from my database, iterate over the ids and perform the task for each one, but I think that as the number of users grow, this will take longer, and no longer fall within 1 minute intervals. Perhaps I should queue the user Ids, and perform the task individually for each one? In which case, I'm actually unsure of how to proceed.
Thanks in advance for any advice.
Edit
To answer Oddthinking's question:
I would like to start the processes for each user at the same time. When the process for each user completes, I want to wait 1 minute, then begin the process again. So I suppose each process for each user should be asynchronous - the process for user 1 shouldn't care about the process for user 2.
To answer sims' question:
I have no control over the external service, and the users of the external service are not the same as the users in my database. I'm afraid I don't know any other scripting languages, so I need to use PHP to do this.
Am I summarising correctly?
You want to do thousands of tasks per minute, but you are not sure if you can finish them all in time?
You need to decide what do when you start running over your schedule.
Do you keep going until you finish, and then immediately start over?
Do you keep going until you finish, then wait one minute, and then start over?
Do you abort the process, wherever it got to, and then start over?
Do you slow down the frequency (e.g. from now on, just every 2 minutes)?
Do you have two processes running at the same time, and hope that the next run will be faster (this might work if you are clearing up a backlog the first time, so the second run will run quickly.)
The answers to these questions depend on the application. Cron might not be the right tool for you depending on the answer. You might be better having a process permanently running and scheduling itself.
So, let me get this straight: You are querying an external service (what? SOAP? MYSQL?) every minute for every user in the database and storing the results in the same database. Is that correct?
It seems like a design problem.
If the users on the external service are the same as the users in your database, perhaps the two should be more closely configured. I don't know if PHP is the way to go for syncing this data. If you give more detail, we could think about another solution. If you are in control of the external service, you may want to have that service dump it's data or even write directly to the database. Some other syncing mechanism might be better.
EDIT
It seems that you are making an application that stores data for a user that can then be viewed chronologically. Otherwise you may as well just fetch the data when the user requests it.
Fetch all the user IDs in go.
Iterate over them one by one (assuming that the data being fetched is unique to each user) and (you'll have to be creative here as PHP threads do not exist AFAIK) call a process for each request as you want them all to be executed at the same time and not delayed if one user does not return data.
Said process should insert the data returned into the db as soon as it is returned.
As for cron being right for the job: As long as you have a powerful enough server that can handle thousands of the above cron jobs running simultaneously, you should be fine.
You could get creative with several PHP scripts. I'm not sure, but if every CLI call to PHP starts a new PHP process, then you could do it like that.
foreach ($users as $user)
{
shell_exec("php fetchdata.php $user");
}
This is all very heavy and you should not expect to get it done snappy with PHP. Do some tests. Don't take my word for it.
Databases are made to process BULKS of records at once. If you're processing them one-by-one, you're looking for trouble. You need to find a way to batch up your "every minute" task, so that by executing a SINGLE (complicated) query, all of the affected users' info is retrieved; then, you would do the PHP processing on the result; then, in another single query, you'd PUSH the results back into the DB.
Based on your big-picture description it sounds like you have a dead-end design. If you are able to get it working right now, it'll most likely be very fragile and it won't scale at all.
I'm guessing that if you have no control over the external service, then that external service might not be happy about getting hammered by your script like this. Have you approached them with your general plan?
Do you really need to do all users every time? Is there any sort of timestamp you can use to be more selective about which users need "updates"? Perhaps if you could describe the goal a little better we might be able to give more specific advice.
Given your clarification of wanting to run the processing of users simultaneously...
The simplest solution that jumps to mind is to have one thread per user. On Windows, threads are significantly cheaper than processes.
However, whether you use threads or processes, having thousands running at the same time is almost certainly unworkable.
Instead, have a pool of threads. The size of the pool is determined by how many threads your machine can comfortable handle at a time. I would expect numbers like 30-150 to be about as far as you might want to go, but it depends very much on the hardware's capacity, and I might be out by another order of magnitude.
Each thread would grab the next user due to be processed from a shared queue, process it, and put it back at the end of the queue, perhaps with a date before which it shouldn't be processed.
(Depending on the amount and type of processing, this might be done on a separate box to the database, to ensure the database isn't overloaded by non-database-related processing.)
This solution ensures that you are always processing as many users as you can, without overloading the machine. As the number of users increases, they are processed less frequently, but always as quickly as the hardware will allow.

Categories