PHP scripts in cron jobs are double processing - php

I have 5 cron jobs running a PHP file. The PHP file checks the MySQL database for items that require processing. Since cron launches the scripts all at the same time, it seems that some of the items are processed twice, or even sometimes up to five times.
Upon SELECting the file in one of the scripts, it immediately sends an UPDATE query so that other jobs shouldn't run it again. But looks like it's still double processing.
What can I do to prevent the other scripts from processing an item that was previously selected by the other cron jobs?

This issue is called "race condition". In this case it happens due to SELECT and UPDATE, though called one after another, are not a single operation. Therefore, there is a chance that two jobs do SELECT the same job, then first does UPDATE, and then second does UPDATE. And so they proceed to run this job simultaneously.
There is a workaround, however.
You could add a field to your table containing ID of current cron job worker (if you run it all on one machine, it may be PID). In worker you do UPDATE first, trying to reserve a job for it:
UPDATE jobs
SET worker = $PID, status = 'processing'
WHERE worker IS NULL AND status = 'awaiting' LIMIT 1
Then you verify you successfully reserved a job for this worker:
SELECT * FROM jobs WHERE worker = $PID
If it did not return you a row, it means other worker was first to reserve it. You can try again from step 1 to aquire another job. If it did return a row, you do all your processing, and then final UPDATE in the end:
UPDATE jobs
SET status = 'done', worker = NULL
WHERE id = $JOB_ID

I think you have a typical problem to use semaphores. Take a look at this article:
http://www.re-cycledair.com/php-dark-arts-semaphores
The idea would be at first of each script, ask for the same semaphore and wait until it be free. Then SELECT and UPDATE the DB as you do it, free the semaphore and start the process. This is the only way you can be sure that no more than one script is reading the DB while another one is about to write on it.

I would start again. This train of thought:
it takes time to process one item. about 30 seconds. if i have five cron jobs, five items are processed in 30 seconds
This is just plain wrong and you should not write your code with this in mind.
By that logic why not make 100 cron jobs and do 100 per 30 seconds? Answer, because your server is not RoadRunner and it will fall over and fail.
You should
Rethink your problem, this is the most important as it will help with 1 and 2.
Optimise your code so that it does not take 30 seconds.
Segment your code so that each job is only doing one task at a time which will make it quicker and also ensure that you do not get this 'double processing' effect.
EDIT
Even with the new knowledge of this being on a third party server my logic still stands, do not start multiple calls that you are not in control of, in fact this is now even more important.
If you do not know what they are doing with the calls then you cannot be sure they are in the right order, when or if they are processed. So just make one call to ensure you do not get double processing.
A technical solution would be for them to improve the processing time or for you to cache the responses - but that may not be relevant to your situation.

Related

PHP, MySQL, Cron Job - Efficient method to maintain current/live data in large tables?

This is mostly theory, so I apologize if it gets wordy.
Background
The project I'm working on pulls information from other websites (external, not hosted by us). We would like to have as-close-to-live information as possible, so that our users are presented with immediately pertinent information. This means monitoring and updating the table constantly.
It is difficult to show my previous work on this, but I have searched high and low for the last couple of weeks, for "maintaining live data in databases," and "instantly updating database when external changes made," and similar. But all to no avail. I imagine the problem of maintaining up-to-date records is common, so I am unsure why thorough solutions for it seem to be so uncommon.
To keep with the guidelines for SO, I am not looking for opinions, but rather for current best practices and most commonly used/accepted, efficient methods in the industry.
Currently, with a cron job, the best we can do is run an process every minute.
* * * * * cd /home/.../public_html/.../ && /usr/bin/php .../robot.php >/dev/null 2>&1
The thing is, we are pulling data from multiple thousands of other sites (each row is a site), and sometimes an update can take a couple minutes or more. Calling the function only once a minute is not good enough. Ideally, we want near-instant resolution.
Checking if a row needs to be updated is quick. Essentially just your simple hash comparison:
if(hash(current) != hash(previous)){
... update row ...
}
Using processes fired exclusively by the cron job means that if a row ends up getting updated, the process is held-up until it is done, or until the cron job fires a new process a minute later.
No bueno! Pas bien! If, by some horrible twist of fate, every row needed to be updated, then it could potentially take hours (or longer) before all records are current. And in that time, rows that had already been passed over would be out of date.
Note: The DB is set up in such a way that rows currently being updated are inaccessible to new processes. The function essentially crawls down the table, finds the next available row that has not been read/updated, and dives in. Once finished with the update, it continues down to the next available row.
Each process is killed when it reaches the end of the table, or when all the rows in the table are marked as read. At this point, all rows are reset to unread, and the process starts over.
With the amount of data being collected, the only way to improve resolution is to have multiple processes running at once.
But how many is too many?
Possible Solution (method)
The best method I've come up with so far, to get through all rows as quickly as possible, is this:
Cron Job calls first process (P1)
P1 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P1 enters the row, it calls a second identical process (P2) to continue from that point
P2 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P2 enters the row, it calls a third identical process (P3) to continue from that point
... and so on.
Essentially, every time a process enters a row to update it, a new process is called to continue on.
BUT... the parent processes are not dead. This means that as soon as they are finished with their updates, they begin to crawl the table again, looking for the next available row.
AND... on top of this all, a new cron job is still fired every minute.
What this means is that potentially thousands of identical processes could be running at the same time. The number of processes cannot exceed the number of records in the table. Worst-case scenario is that every row is being updated simultaneously, and a cron job or two are fired before any updates are finished. The cron jobs will immediately die, since no rows are available to update. As each process finishes with its updates, it would also immediately die for the same reason.
The scenario above is worst-case. It is unlikely that more than 5 or 10 rows will ever need to be updated each pass, but theoretically it is possible to have every row being updated simultaneously.
Possible Improvements (primarily on resources, not speed or resolution)
Monitor and limit the number of live processes allowed, and kill any new ones that are fired. But then this begs questions like "how many is too many?", and "what is the minimum number required to achieve a certain resolution?"
Have each process mark multiple rows at a time (5-10), and not continue until all rows in the set have been dealt with. This would have the effect of decreasing the maximum number of simultaneous processes by a factor of however many rows get marked at a time.
Like I said at the beginning, surely this is a common problem for database architects. Is there a better/faster/more efficient method than what I've laid out, for maintaining current records?
Thanks for keeping with me!
First of all, I read it all! Just had to pat myself on the back for that :)
What you are probably looking for is a worker queue. A queue is basically a line like the one you would find in a supermarket, and a worker is the woman at the counter receiving the money and doing everything for each customer. When there is no costumer, she doesn't do work, and when there is, she does do work.
When there are a lot of customers in the mall, more of the workers go on the empty counters, and the people buying groceries get distributed amongst all of them.
I have written a lot about queues recently, and the one I most recommend is Beanstalk. It's simple to use, and it uses the Pheanstalk API if you are planning to create queues and workers in php (and from there control what happens in your database in MySQL).
An example of how a queue script and a worker scrip would look is similar to the following (obviously you would add your own code to adapt to your specific needs, and you would generate as many workers as you want. You could even have your workers vary depending on how much demand you have from your queue):
Adding jobs to the queue
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
$pheanstalk
->useTube("my_queue")
->put("UPDATE mytable SET price = price + 4 WHERE stock = GOOG");//sql query for instance
?>
From your description, it seems you are setting transactions, which is prohibiting some updates to take place while others are being implemented. This is actually a great reason to use a queue because if a queue job times out, it is sent to the top of the queue (at least in the pheanstalk queue I am describing), which means it won't be lost in the situation of a timeout.
Worker script:
<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
if ($job = $pheanstalk
->watch('my_queue')
->ignore('default')
->reserve())//retreives the job if there is one in the queue
{
echo $job->getData();//instead of echoing you would
//have your query execute at this point
$pheanstalk->delete($job);//deletes the job from the queue
}
}
?>
You would have to do some changes like design how many workers you would have. You might put 1 worker in a while loop obtaining all the jobs and executing them 1 by one, and then call other worker scripts to help in the case that you see that you executed 3 and more are coming. There are many ways of managing the queue, but it is what is often used in situations like the one you described.
Another great benefit of queues from a library as recommended as pheanstalk is that it is very versatile. If in the future you decide you want to organize your workers differently, you can do so easily, and there are many functions that make your job easier. No reason to reinvent the wheel.

Prevent parallel execution using a table lock (MySQL)

I have a MySQL table called cronjobs which holds entires for every cronjob needed (e.g. delete old emails, update profile age, and so on). For every cronjob there is a defined code block which gets executed if the cronjob is due (I got different intervals for different cronjobs).
For the execution of the due cronjobs, I got a PHP script which is executed by UNIX crontab every minute (calls execute_cronjobs_due.sh which calls "php -f /path/to/file/execute_cronjobs_due.php").
When executing execute_cronjobs_due.php all cronjobs get marked that they are going to be executed, so that another call of execute_cronjobs_due.php wouldn't cause a parallel execution of the same cronjob getting already executed.
Now the problem: Sometimes the execution takes more than 60 seconds but the crontab program does not call execute_cronjobs_due.sh after these 60 seconds. What actually happens is that execute_cronjobs_due.sh is called right after the execution of the execution of the previous crontab. And if an execution takes more than 120 seconds, the next two executions are initialize simultaneously.
Timeline:
2015-06-15 10:00:00: execution execute_cronjobs_due.sh (takes 140 seconds)
2015-06-15 10:02:20: two simultaneous executions of execute_cronjobs_due.sh
Since it is executed exactly simultaneous, there is no use of marking the cronjob that they are being executed since the selects (which should actually exclude the marked once) are executed at the exact same time. So the update occurs right after both already selected the due cronjobs.
How can I solve this problem, so that there are no simultaneous executions of cronjobs? Can I use MySQL table locks?
Thank you very much for your help in advance,
Frederic
Yes you could use mysql table locks, but this may be overkill for your situation. Anyway to do that in most generic way
Make sure that you have autocommit off
LOCK TABLES cronjobs;
do your stuff
UNLOCK TABLES
for exact syntax and details read the docs obviusly https://dev.mysql.com/doc/refman/5.0/en/lock-tables.html , I personally never used table level locking so maybe there are some catches involved I am not aware of.
What I would do, if you use InnoDB table engine is to go with optimistic locking:
start transaction as a first thing in your script
get some id of script or whatever, might be process pid (getmypid()) or combination of host+pid. Or just generate guid if you don't know which will be perfect
do something like UPDATE cronjobs SET executed_by = my_id WHERE executed_by is null and /* whatever condition to get jobs to run */
then SELECT * FROM cronjobs where executed_by = my_pid
do your stuff on whatever above select returned
UPDATE cronjobs set executed_by = null where executed_by = my_pid
This should be as easy to do, easier to track what happens and scale in the future (i.e. you can have few instances running running in parallel as long as they execute different scripts)
With this solution second script will not fail (technically), it will just run 0 jobs.
Minus is that you will have to clean jobs that were claimed but script failed to mark them as finished, but you probably have to do it anyway with current solution. The easiest way would be to add a timestamp column that would track when was the job claimed last time and expire it after i.e. 15 minutes or an hour depending on business requirements (short pseudocode: first update will do SET executed_by = my_id, started_at = NOW() where executed_by is null or (executed_by is not null and started_at < NOW() - 1 hour))
How can I solve this problem, so that there are no simultaneous executions of cronjobs?
There are multiple ways to solve this. They might be helpful as-well:
My suggestion is to keep it simple and use either a file-locking or file-exist checking approach.
file_exist() + PID based CronHelper Class
http://abhinavsingh.com/how-to-use-locks-in-php-cron-jobs-to-avoid-cron-overlaps/
flock() based: https://stackoverflow.com/a/5428665/1163786
when you want to avoid IO, store the locking-state into memcache
database transactions: see below and #sakfa's answer
lock cronjobs across a distributed system using Redis as central: https://github.com/kvz/cronlock & http://kvz.io/blog/2012/12/31/lock-your-cronjobs/
Can I use MySQL table locks?
Yes, but it's a bit overkill.
You would use a "cronjob processing table" with a cronjob status column ("ToDo, Started, Complete" or "Todo, Running, Done") and a PID column.
Then you select jobs and mark their state by using transactions.
That makes sure that "Selecting a job from Todo" and "marking it as running/started" is done in one step. In the end, you might still have multiple exec's of your "central cronjob processing script", but jobs are NOT selected multiple times for processing.

Endless loop in cron job php

I have to send bulk emails to the users. I think of having an endless loop in a cron job, where I want to fetch a few dozens or hundreds users and send emails one by one - updating the the table, that the email was sent. And also I should put some sleep interval, as soon as each packet of dozen(or hundred) users received the email. Basically it looks like
while(1 != 0) {
$notifications = // fetch notifications, where email is not sent
foreach($notifications as $notification) {
// 1) send email
// 2) update table - email was sent
}
sleep(5);
}
Now, is this all right to use, or it is considered a bad practice ?
I know, I can also use multiple crons, lets say every one minute, but to prevent overlapping when using lock file, as soon as the cron starts and the lock file exists(so another cron is still running) it should either
a) wait for some time to the first cron to finish, to start,
or
b) just return empty, allowing the next cron to do the job ASA the ongoing one is done.
The problem with a) is that, what if the crons take lot more time than expected, then after some time I will have bunch of crons in a "waiting" state. About the b) case, what if immediately after the second cron is done(returning empty), the first cron ends, so I will have a gap of ~ one minute, and I need to send emails to users as soon as possible.
also, qsn 2, what is better in performance wise, one cron in loop vs multiple crons?
Thanks
What you are describing a daemon, not a cron task.
There are lots of daemons that run continuously, so no, it's not a bad practice to do that.
If you want the daemon automatically restarted if it crashes, you could have a watchdog task, which continuously checks that the daemon is running, and starts a daemon process if one isn't running.
Another alternative (as you describe) is to have crontask that occasionally attempts to start the daemon; the startup should detect whether the daemon process is already running. If it's already running, leave it be, and just exit. If it's not running, then start another one (in the background, as a detached process. Either way, the crontask completes quickly.
(And it doesn't matter one whit whether the daemon connects to MySQL.)
Personally, I dislike endless loops. I prefer a cron job running every 5 minutes for example.
And you can optimize your script for send max emails quantity in cron job time.
You need to estimate how many emails you will send per minute. I'll assume 1 email per second.
So my idea is:
Query for 290 notifications [10 seconds delay to get and update notifications] and mark them as "sending" status (to prevent next cron dont pick them).
Send emails and save result in array (for later update).
When finished, update notifications status (sent or error).
Just my 2 cents.

Is handling concurrency in cron jobs important?

I have a cron job in my application which does the following:
It gets entries from a database table whose status is set to 'pending'. Columns list is as below: id,name,title,ip,status
For each such entry it does a REST (Web service) call, get response, process it and store the data in database.
I presently set the cron job interval to 1 min.
But sometimes, it's execution might take as much as 5-10 mins (rare cases).
For my case, is it important to handle concurrency of cron job? Using lock files etc?
Presently when a entry is being processed, i'm change the value of the entries state column to 'processing', so that it is not processed again by the next call of cron job.
Using lock files has extra advantage that if same script is executed twice (by mistake), only
the first one will be executed.
Though you are setting column status 'pending', but I think it will still be problematic if script executed twice (at same time), unless you locked the rows/table.

Running a PHP script or function at an exact point in the future

I'm currently working on a browser game with a PHP backend that needs to perform certain checks at specific, changing points in the future. Cron jobs don't really cut it for me as I need precision at the level of seconds. Here's some background information:
The game is multiplayer and turn-based
On creation of a game room the game creator can specify the maximum amount of time taken per action (30 seconds - 24 hours)
Once a player performs an action, they should only have the specified amount of time to perform the next, or the turn goes to the player next in line.
For obvious reasons I can't just keep track of time through Javascript, as this would be far too easy to manipulate. I also can't schedule a cron job every minute as it may be up to 30 seconds late.
What would be the most efficient way to tackle this problem? I can't imagine querying a database every second would be very server-friendly, but it is the direction I am currently leaning towards[1].
Any help or feedback would be much appreciated!
[1]:
A user makes a move
A PHP function is called that sets 'switchTurnTime' in the MySQL table's game row to 'TIMESTAMP'
A PHP script that is always running in the background queries the table for any games where the 'switchTurnTime' has passed, switches the turn and resets the time.
You can always use a queue or daemon. This only works if you have shell access to the server.
https://stackoverflow.com/a/858924/890975
Every time you need an action to occur at a specific time, add it to a queue with a delay. I've used beanstalkd with varying levels of success.
You have lots of options this way. Here's two examples with 6 second intervals:
Use a cron job every minute to add 10 jobs, each with a delay of 6 seconds
Write a simple PHP script that runs in the background (daemon) to adds an a new job to the queue every 6 seconds
I'm going with the following approach for now, since it seems to be the easiest to implement and test, as well as deploy on different kinds of servers/ hosting, while still acting reliably.
Set up a cron job to run a PHP script every minute.
Within that script, first do a query to find candidates that will have their endtime within this minute.
Start a while-loop, that runs until 59 seconds have passed.
Inside this loop, check the remianing time for each candidate.
If teh time limit has passed, do another query on that specific candidate to ensure the endtime hasn't changed.
If it has, re-add it to the candidates queue as nescessary. If not, act accordingly (in my case: switch the turn to the next player).
Hope this will help somebody in the future, cheers!

Categories