I need to run some tasks continuously. These tasks consist, mainly, of retrieving specific records from the DB, analyzing and saving them. This a non-trivial analysis, which might take several seconds (more than a minute, perhaps).
I do not know how frequently will new records be saved in the DB waiting for analysis (there's another cronjob for that).
Should I retrieve records one by one calling the same analysis function again once it finishes (recursively) and try to keep the cronjob running until there are no more unanalyzed records?
Or should I retrieve a fixed amount of new records on each cronjob run and call the cronjob every certain amount of minutes?
A job queue server may work well for this scenario (See ActiveMQ or MemcacheQ for example. Rather than adding the un-analyzed records directly to the database, send them to a queue for processing. Then your cron job could retrieve some items from the queue for processing, and if one job takes so long to run the cron job is triggered again, the next one will run and grab the next items in the queue.
Personally, I would have the cron job retrieve a fixed number of records for processing, just to make sure you don't get the script stuck processing for a very long time in the event new records keep getting added and the processor can't keep up. Eventually it would probably finish everything but you could end up in a situation where it continues for a very long time.
You may consider creating a lock file as well that the job can look for to see if the task processor is already running. For example when the cron job starts, check for the existence of a file (e.g. processor.lock), if it exists, exit, if not, create the file, process some records, and delete the file.
Hope that helps.
Or should I retrieve a fixed amount of new records on each cronjob run and call the cronjob every certain amount of minutes?
That. And you'll have to do some trial and error metrics first to decide an optimal fixed amount.
Of course it heavily depends on what you are actually doing, how many db intensive cron jobs you are running simultaneously and what kind of setup you have. I recently spent a day looking for a Heisenbug in a very intensive script that migrated images from db to s3 (and created a few thumbs while migrating). The problem was that due to an undocumented behaviour in our ORM the connection to the database was lost at some point, as posting to s3 + thumbs generation for certain images took a little bit more than the connection time limit. It was an ugly situation, that would probably cost more than a day to identify in a recursive do it all scheme.
You'd be better off with the safe approach, even if it means a little time lost between cron executions.
Instead of using a cron job, I would use The Fat Controller to run and repeat tasks. It is basically a daemon which can run any script or application and restart it after it finishes, optionally with a delay between runs.
You can additionally specify a timeout so that long-running scripts will be stopped. This way you don't need to care about locking, long-running processes, error process and so on. It will help to keep your business logic clean.
There's more examples and use cases on the website:
http://fat-controller.sourceforge.net/
Related
I have a cron job script that runs every 60 seconds to process and store results in a database. That’s a maximum of 1,440 new database entries per day.
I need to have many many millions of database entries, so doing this with just one instance of this script is really impractical. I’m looking for a minimum of a 50x speed up, and ideally 300x to 500x if the cost is reasonable.
It seems like I need a server farm, but I have to use Amazon Web Services to process this data. How can I set this script up to run many simultaneous instances, while storing the data in a single, unified database?
Do I need to create completely separate server instances for every time I want to run this script, multiplying the cost?
Thank you for your help!
A serverless approach using a remote lambda function to execute your job triggered by a queue system solve your problem both technically and at pricing level.
https://aws.amazon.com/lambda/
For example you can trigger a lambda function execution from a local centralized script (eg. by a single cron) which enqueue some messages to a queue system for as many entries you need to compute in an asyncronous/concurrent way.
The serverless framework can help you to avoid AWS lock-in:
https://serverless.com/
I have a database where I have to save a lot of data.
All this data recovered from an Unofficial API that retrieve data from official site.
Unofficial API is the only one I can use, because official API is really short developed, but Unofficial API is really really slow.
I couldn't do anything to speed up this process.
To automate data update, I've created php pages that recovers data from Unofficial API, combine it and store it into the database.
These pages refresh every 30 seconds continuing process in order to avoid server time limit, set to 30 secs and not modifiable.
Than I've created multiple cron jobs.
But there's a problem.
Cron jobs work for at least 30/60/90/120 seconds, but the whole data update process can goes on for 20/30 mins.
Do you have any good idea on how to solve this problem?
I'll be more clear, sorry.
The problem is this:
- script is long about 20/30 mins, divided in multiple auto-loads, this recover and load all the data into the database;
- a cron job works for max 120 seconds than it returns a time-out error.
I need, if possibile, to find a way to have all data loaded into database, four time a day.
Do you have any idea?
Second update.
This is not a question related to TIME LIMIT of SERVER, I've overcome this problem cutting my script in pieces.
This is a metter of CRON JOB TIME LIMIT.
I cannot do anything (that I know) to accelerate script execution, so my data recover is really long and I need time execute it all, but cron jobs are too fast.
I have a cron job running that executes a PHP file which checks a MySQL database for a change. I have this script running every minute but it's very simply query.
Still, is running a cron job like this every minute going to be too hard on the server? Is there a better approach to what I'm doing?
Depending on the query, if it's SQL related only, consider a MySQL event. But it depends on what it does. If PHP code is required to interact with it... events won't help. If it just does some updates in MySQL (like expired user sessions and removing unconfirmed user accounts)... an event will do.
Running every minute is not hard on the server, it depends on what it does. microtime() execution of the script and log it to a text file with register_shutdown_function() (you can also memory_get_peak_usage()). See how long it takes, how much memory it consumes... that will tell you how hard it is on the server.
A simple query like that shouldn't be an issue... most PHP pages make many more calls, and the server can be processing many requests at once. Unless it's a beast of a query, you'll be fine.
I have a Cron Job with PHP which I want to set up on my webhost, but at the moment the script takes about 20 seconds to run with only 3 users data being refreshed. If I get a 1000 users - gonna take ages. Is there an alternative to Cron Job? Will my web host let me run a cron job which takes, for example, 10 minutes to run?
Your cron job can be as long as you want.
The main problem for you is that you must ensure the next cron job execution is not occuring while the first one is still running. You have a lot of solutions to avoid it, basically use a semaphore.
It can be a lock file, a record in database. Your cron job should check if the previous one is finished or not. A good thing is maybe sending you an email if he cannot run because of a long previous job (this way you'll have some notice alerting you that something is maybe getting wrong) By default cron jobs with bad error dstatus on exit are outputing all the standard output to the email of the account running the job, depending on how is configured the platform you could use this behavior or build an smtp connexion on the job (or store the alert in a database table).
If you want some alternatives to cron jobs you should have a look at work queues. You can mix work queues with a cron job, or use work queue in apache-php envirronment, lot of solutions, but the main idea is to make on single queue of things that should be done, and execute them one after the other (but be careful, if you handle theses tasks very slowly you'll get a big fat waiting queue).
A cron job shouldn't have any bearing on how long it's 'job' takes to complete. If you're jobs are taking 20 seconds to complete, it's PHP's fault, not cronjob.
Will my web host let me run a cron job which takes, for example, 10 minutes to run?
Ask your webhost.
If you want to learn about optimizing php scripts, take a look at Profiling PHP Code.
I have a php script run as a cron job that executes a set of simple tasks that loops for each user in the database and takes about 30 mins to complete. This process starts over every hour and needs to be as fast and efficient as possible. The problem Im having, is like with any server script, execution time varies and I need to figure out the best cron time settings.
If I run cron every minute, I need to stop the last loop of the script 20 seconds before the end of the minute to make sure that the current loop finishes in time. Over the course of the hour this adds up to a lot of wasted time.
Im wondering if its a bad idea to simple remove the php execution time limit and run the script once an hour and let it run to completion.... is this a bad idea?
Instead of setting the max_execution_time you could also use set_time_limit() to reset the counter on every loop. This will ensure your script is never running out of time unless there is something serious hanging within the current loop (and taking longer than the max_execution_time).
Basically this should make your script run as long as it needs while giving it a 30 seconds timeout between two set_time_limit() calls.
Assuming you'd like the work done ASAP, don't use cron. Cron is good for things that need to happen at specific times. It's often abused to simulate a background process that would ideally process work as soon as work appears. You should probably write a daemon that runs continuously. (Note: you could also look at a message/work-queue type system, there are nice libraries out there to do this too)
You can write a daemon from scratch using the pcntl functions (since you don't care about multiple worker processes, it's super-easy to get a process running in the background.), or cheat and just make a script that runs forever and run it via screen, or leverage some solid library code like PEAR's System:Daemon or nanoserv
Once the daemonization stuff is taken care of, all you really care about is having a loop that runs forever. You'll want to take care that your script doesn't leak memory, or consume too many resources.
Generally, you can do something like:
<?PHP
// some setup code
while(true){
$todo = figureOutIfIHaveWorkToDo();
foreach($todo as $something){
//do stuff with $something
//remember to clean up resources so you don't leak memory!
usleep(/*some integer*/);
}
usleep(/* some other integer */);
}
And it'll work pretty well.
Setting the time limit to 0 and letting it do its thing is fairly typical of PHP based cronjobs (in my experience), but this is also the point when you should ask yourself a few important questions, such as "Should I rewrite this job in a compiled language?" and "Am I using all of my tools (database, etc) to their maximum efficiency?"
That said, maybe better than completely removing the time limit would be to set it to the upper limit you actually want. If that means 48 minutes, then set_time_limit(48 * 60);
I really think you shouldn't set the time out to 0, that is just looking for trouble. At most, set it to 59*60 seconds, but setting it to 0 might cause security problems, if a script hangs, it will hang almost forever until the server host stops the execution. It is considered bad practice to do so.
I have used the php command-line interface for similar long running tasks in the past. You probably do not want to remove the execution time limit for any request.
Sounds like a great idea if there's little chance that it will take more than an hour. Note, however, that the wrong bug can be a really good way of making it take longer than expected..
To avoid all sorts of nasty problems, you should have a guard file with the process ID of the script. On startup, you should check to make sure the file doesn't exist, or if it does that the process ID in the file doesn't exist (through a kill( pid, 0 ) call). If these conditions are met, create a new file with the script's PID and delete the file when you're done.
This is the same trick that many daemons use to ensure it isn't already running. If the daemon was killed suddenly, the file will still exist but the PID of the process therein is unlikely to be running.
Depending on what your script does, it can lead to problems if you remove the time limit. If per example, you are polling an external server that is unresponsive while the job is running, and that your cron takes 2 hours instead of 30 minutes to complete, you may get a stack of PHP processes being fired up even if the previous ones haven't completed yet. This can cause system instability and crashes.
You probably have two options:
Make sure that no other instance of your script is running beforehand, otherwise exit() on start.
Consider changing your cronjob into a daemon.
Does it have to run hourly like clockwork?
If not split the job (you mentioned it was more than one simple task) do each task every hour?
Or split it per user, do A-M on hour, then N-Z the next?