Prevent parallel execution using a table lock (MySQL) - php

I have a MySQL table called cronjobs which holds entires for every cronjob needed (e.g. delete old emails, update profile age, and so on). For every cronjob there is a defined code block which gets executed if the cronjob is due (I got different intervals for different cronjobs).
For the execution of the due cronjobs, I got a PHP script which is executed by UNIX crontab every minute (calls execute_cronjobs_due.sh which calls "php -f /path/to/file/execute_cronjobs_due.php").
When executing execute_cronjobs_due.php all cronjobs get marked that they are going to be executed, so that another call of execute_cronjobs_due.php wouldn't cause a parallel execution of the same cronjob getting already executed.
Now the problem: Sometimes the execution takes more than 60 seconds but the crontab program does not call execute_cronjobs_due.sh after these 60 seconds. What actually happens is that execute_cronjobs_due.sh is called right after the execution of the execution of the previous crontab. And if an execution takes more than 120 seconds, the next two executions are initialize simultaneously.
Timeline:
2015-06-15 10:00:00: execution execute_cronjobs_due.sh (takes 140 seconds)
2015-06-15 10:02:20: two simultaneous executions of execute_cronjobs_due.sh
Since it is executed exactly simultaneous, there is no use of marking the cronjob that they are being executed since the selects (which should actually exclude the marked once) are executed at the exact same time. So the update occurs right after both already selected the due cronjobs.
How can I solve this problem, so that there are no simultaneous executions of cronjobs? Can I use MySQL table locks?
Thank you very much for your help in advance,
Frederic

Yes you could use mysql table locks, but this may be overkill for your situation. Anyway to do that in most generic way
Make sure that you have autocommit off
LOCK TABLES cronjobs;
do your stuff
UNLOCK TABLES
for exact syntax and details read the docs obviusly https://dev.mysql.com/doc/refman/5.0/en/lock-tables.html , I personally never used table level locking so maybe there are some catches involved I am not aware of.
What I would do, if you use InnoDB table engine is to go with optimistic locking:
start transaction as a first thing in your script
get some id of script or whatever, might be process pid (getmypid()) or combination of host+pid. Or just generate guid if you don't know which will be perfect
do something like UPDATE cronjobs SET executed_by = my_id WHERE executed_by is null and /* whatever condition to get jobs to run */
then SELECT * FROM cronjobs where executed_by = my_pid
do your stuff on whatever above select returned
UPDATE cronjobs set executed_by = null where executed_by = my_pid
This should be as easy to do, easier to track what happens and scale in the future (i.e. you can have few instances running running in parallel as long as they execute different scripts)
With this solution second script will not fail (technically), it will just run 0 jobs.
Minus is that you will have to clean jobs that were claimed but script failed to mark them as finished, but you probably have to do it anyway with current solution. The easiest way would be to add a timestamp column that would track when was the job claimed last time and expire it after i.e. 15 minutes or an hour depending on business requirements (short pseudocode: first update will do SET executed_by = my_id, started_at = NOW() where executed_by is null or (executed_by is not null and started_at < NOW() - 1 hour))

How can I solve this problem, so that there are no simultaneous executions of cronjobs?
There are multiple ways to solve this. They might be helpful as-well:
My suggestion is to keep it simple and use either a file-locking or file-exist checking approach.
file_exist() + PID based CronHelper Class
http://abhinavsingh.com/how-to-use-locks-in-php-cron-jobs-to-avoid-cron-overlaps/
flock() based: https://stackoverflow.com/a/5428665/1163786
when you want to avoid IO, store the locking-state into memcache
database transactions: see below and #sakfa's answer
lock cronjobs across a distributed system using Redis as central: https://github.com/kvz/cronlock & http://kvz.io/blog/2012/12/31/lock-your-cronjobs/
Can I use MySQL table locks?
Yes, but it's a bit overkill.
You would use a "cronjob processing table" with a cronjob status column ("ToDo, Started, Complete" or "Todo, Running, Done") and a PID column.
Then you select jobs and mark their state by using transactions.
That makes sure that "Selecting a job from Todo" and "marking it as running/started" is done in one step. In the end, you might still have multiple exec's of your "central cronjob processing script", but jobs are NOT selected multiple times for processing.

Related

Can we schedule cron job in the php script?

I'm new to this cronjobs and I want few emails (2k to 3k mails to be more precise) to be sent at specific time and date which are in the database table.Currently to achieve this, I'm calling my mail function file(sendmail.php) for every minute using cron job and comparing the current time and the time which comes from the db table, if true the mail will be sent.By doing this I'm afraid there will be some effect on the performance.
Can we schedule cronjob right after the insert query in the php script. So that I can pass those time and date variables to it?
Does calling the file for every minute in the cron job is a good practice? Will the performance get effected because my application will be used by 25 users at a time?
Although by calling the file for every minute achieves my task, but still want to know if there are any better ways.
Thank you in advance.
for every minute using cron
If you're firing off cron jobs every minute then you're doing some thing wrong. There are problems with jitter, and concurrency.
comparing the current time and the time which comes from the db table
Does that mean you are doing the tie check outside of the DBMS? That would be very silly.
Can we schedule cronjob right after the insert query in the php script
Yes, although you'd need to use sudo to create privilege separation. However you having (potentially) thousands of cron jobs is a very bad idea.
While there is a lot missing from your problem statement, based on what you have said, I'd suggest having a cron job running once every (say) 15 minutes, polling the database for the emails to be sent in that time window - with the time comparison and concurrency locking done in the database.

Is handling concurrency in cron jobs important?

I have a cron job in my application which does the following:
It gets entries from a database table whose status is set to 'pending'. Columns list is as below: id,name,title,ip,status
For each such entry it does a REST (Web service) call, get response, process it and store the data in database.
I presently set the cron job interval to 1 min.
But sometimes, it's execution might take as much as 5-10 mins (rare cases).
For my case, is it important to handle concurrency of cron job? Using lock files etc?
Presently when a entry is being processed, i'm change the value of the entries state column to 'processing', so that it is not processed again by the next call of cron job.
Using lock files has extra advantage that if same script is executed twice (by mistake), only
the first one will be executed.
Though you are setting column status 'pending', but I think it will still be problematic if script executed twice (at same time), unless you locked the rows/table.

PHP scripts in cron jobs are double processing

I have 5 cron jobs running a PHP file. The PHP file checks the MySQL database for items that require processing. Since cron launches the scripts all at the same time, it seems that some of the items are processed twice, or even sometimes up to five times.
Upon SELECting the file in one of the scripts, it immediately sends an UPDATE query so that other jobs shouldn't run it again. But looks like it's still double processing.
What can I do to prevent the other scripts from processing an item that was previously selected by the other cron jobs?
This issue is called "race condition". In this case it happens due to SELECT and UPDATE, though called one after another, are not a single operation. Therefore, there is a chance that two jobs do SELECT the same job, then first does UPDATE, and then second does UPDATE. And so they proceed to run this job simultaneously.
There is a workaround, however.
You could add a field to your table containing ID of current cron job worker (if you run it all on one machine, it may be PID). In worker you do UPDATE first, trying to reserve a job for it:
UPDATE jobs
SET worker = $PID, status = 'processing'
WHERE worker IS NULL AND status = 'awaiting' LIMIT 1
Then you verify you successfully reserved a job for this worker:
SELECT * FROM jobs WHERE worker = $PID
If it did not return you a row, it means other worker was first to reserve it. You can try again from step 1 to aquire another job. If it did return a row, you do all your processing, and then final UPDATE in the end:
UPDATE jobs
SET status = 'done', worker = NULL
WHERE id = $JOB_ID
I think you have a typical problem to use semaphores. Take a look at this article:
http://www.re-cycledair.com/php-dark-arts-semaphores
The idea would be at first of each script, ask for the same semaphore and wait until it be free. Then SELECT and UPDATE the DB as you do it, free the semaphore and start the process. This is the only way you can be sure that no more than one script is reading the DB while another one is about to write on it.
I would start again. This train of thought:
it takes time to process one item. about 30 seconds. if i have five cron jobs, five items are processed in 30 seconds
This is just plain wrong and you should not write your code with this in mind.
By that logic why not make 100 cron jobs and do 100 per 30 seconds? Answer, because your server is not RoadRunner and it will fall over and fail.
You should
Rethink your problem, this is the most important as it will help with 1 and 2.
Optimise your code so that it does not take 30 seconds.
Segment your code so that each job is only doing one task at a time which will make it quicker and also ensure that you do not get this 'double processing' effect.
EDIT
Even with the new knowledge of this being on a third party server my logic still stands, do not start multiple calls that you are not in control of, in fact this is now even more important.
If you do not know what they are doing with the calls then you cannot be sure they are in the right order, when or if they are processed. So just make one call to ensure you do not get double processing.
A technical solution would be for them to improve the processing time or for you to cache the responses - but that may not be relevant to your situation.

How can I get MySQL to run queries on an interval?

I'm creating a web application where every row of a table needs to be processed. I'm spawning one child PHP process per table row. I'm implementing a safety mechanism, so if a PHP process is interrupted processing a row, a new PHP process will spawned to process said row. To do this I'm going to create a new table where all PHP processes check in every 10 seconds or so. I need MySQL to delete all rows that haven't been checked into for 5 minutes or more, so my application will know to create a new PHP child to process that row.
I know it's possible to get MySQL to run queries on an interval, but I don't know how.
~Enter stackoverflow~
Edit: I was hoping to learn how to do this 100% MySQL. Is there no way to set MySQL to run a query every hour, or at a specific time each day or such?
Crontab. You can run the query directly using the mysql client (mysql -uusername -ppassword dbname -e 'query here') or schedule a PHP script which runs the query.
DELETE FROM table WHERE checked_into < CURRENT_TIMESTAMP - INTERVAL 5 MINUTE
MySQL Events are tasks that run according to a schedule. Therefore, we sometimes refer to them as scheduled events. ... Conceptually, this is similar to the idea of the Unix crontab (also known as a “cron job”) or the Windows Task Scheduler.
http://dev.mysql.com/doc/refman/5.1/en/events-overview.html
And here is the lovely syntax: http://dev.mysql.com/doc/refman/5.1/en/create-event.html
One way to run MySQL queries on a certain interval would be to set up a cron job. Assuming you've got full access to your webserver, this should be doable. You'd just make a PHP page that does the SQL operations you want to occur every X time interval, and then set the script to run on that interval via cron jobs. More specifics: http://en.wikipedia.org/wiki/Cron
I think what you are looking for is an event scheduler, first introduced in MySQL 5.1.
On a side note, maybe you should redesign your program a little to avoid the extra layer of event scheduler:
Instead of deleting a row, where a process has not checked in for a while, just have a column with a check in timestamp. Then if some row has a very old check in timestamp, you can spawn a new PHP process for it.

Should I be using message queuing for this?

I have a PHP application that currently has 5k users and will keep increasing for the forseeable future. Once a week I run a script that:
fetches all the users from the database
loops through the users, and performs some upkeep for each one (this includes adding new DB records)
The last time this script ran, it only processed 1400 users before dieing due to a 30 second maximum execute time error. One solution I thought of was to have the main script still fetch all the users, but instead of performing the upkeep process itself, it would make an asynchronous cURL call (1 for each user) to a new script that will perform the upkeep for that particular user.
My concern here is that 5k+ cURL calls could bring down the server. Is this something that could be remedied by using a messaging queue instead of cURL calls? I have no experience using one, but from what I've read it seems like this might help. If so, which message queuing system would you recommend?
Some background info:
this is a Symfony project, using Doctrine as my ORM and MySQL as my DB
the server is a Windows machine, and I'm using Windows' task scheduler and wget to run this script automatically once per week.
Any advice and help is greatly appreciated.
If it's possible, I would make a scheduled task (cron job) that would run more often and use LIMIT 100 (or some other number) to process a limited number of users at a time.
A few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but more than 30 seconds would be a start.
Track Upkeep against Users
Maybe add a field for each user, last_check and have that field set to the date/time of the last successful "Upkeep" action performed against that user.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
Why don't you still use the cURL idea, but instead of processing only one user for each, send a bunch of users to one by splitting them into groups of 1000 or something.
Have you considered changing your logic to commit changes as you process each user? It sounds like you may be running a single transaction to process all users, which may not be necessary.
How about just increasing the execution time limit of PHP?
Also, looking into if you can improve your upkeep-procedure to make it faster can help too. Depending on what exactly you are doing, you could also look into spreading it out a bit. Do a couple once in a while rather than everyone at once. But depends on what exactly you're doing of course.

Categories