Is handling concurrency in cron jobs important?

Is handling concurrency in cron jobs important? - php

I have a cron job in my application which does the following:
It gets entries from a database table whose status is set to 'pending'. Columns list is as below: id,name,title,ip,status
For each such entry it does a REST (Web service) call, get response, process it and store the data in database.
I presently set the cron job interval to 1 min.
But sometimes, it's execution might take as much as 5-10 mins (rare cases).
For my case, is it important to handle concurrency of cron job? Using lock files etc?
Presently when a entry is being processed, i'm change the value of the entries state column to 'processing', so that it is not processed again by the next call of cron job.

Using lock files has extra advantage that if same script is executed twice (by mistake), only
the first one will be executed.
Though you are setting column status 'pending', but I think it will still be problematic if script executed twice (at same time), unless you locked the rows/table.

Related

Handle Queue race condition in PHP Symfony with MySQL Database

I've an application in Symfony that needs to send Emails/Notificatios from the App.
Since the Email/Notifications sending process takes time, so I decided to put them in Queue and process the Queue periodically. Hence I can decrease the response time for the Requests involving the Email/Notification dispatch.
The Cron Job(a php script - Symfony route) to process the queue runs every 30 seconds, and checks if there are any unsent Emails/Notifications if found it gets all data from the Queue Table and starts sending them. When an Email/Notification is sent, the Queue Table row status flag is updated to show that it's sent.
Now, when there are more Emails in Queue which could take more than 30 seconds to send. Another Cron Job also start running and starts sending emails from the Queue. Hence resulting in duplicate Emails/Notifications dispatch.
My Table structure for Email Queue is as follows :
|-------------------------------------|
| id | email | body | status | sentat |
|-------------------------------------|
My Ideas to resolve this issue are as follows :
Set a flag in Database that a Cron Job is running, and no other Cron Jobs should proceed if found the flag set.
Update status as 'sent' for all records and then start sending Emails/Notifications.
So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?

So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?
You can take enqueue-bundle plus doctrine dbal transport.
It already takes care of race conditions and other stuff.

Regarding your suggestions:
What if the cronjob process dies (for whatever reason) and cannot clean up the flag? A flag is not a good idea, I think. If you would like to follow this approach, you should not use a boolean, but rather either a process ID or a timestamp, so that you can check if the process is still alive or if it started a suspiciously long time ago without cleaning up.
Same question: what if the process dies? You don’t want to mark the mails as sent before they are sent.
I guess I’d probably use two fields: one for marking a record as “sending in progress” (thus telling other processes to skip this record) and another one for marking it as “sending successfully completed”. I’d write a timestamp to both, so that I can (automatically or manually) find those records where the “sending in progress” is > X seconds in the past, which would be an indicator for a died process.

You can use Database Transactions here. Rest will be handled by database locking mechanism and concurrency control. Generally whatever DML/DCL/DDL commands you are giving, are treated as isolated Transactions. In your Question, if 2nd cron job will read the rows(before 1st cron job will update it as sent) , it will find the email unsent, and try to send it again. and before 2nd cron job will update it as sent, if 3rd job will find it unsent, it will do same. So it can cause big problem for you.
whatever approach you will take, there will be Race Condition. So let the database allow to do it. there are many concurrency control methods you can refer.
BEGIN_TRANSACTION
/* Perform your actions here. N numbers of read/write */
END_TRANSACTION
Still there is one problem with this solution. You will find at one stage that, when number of read/write operation will increase, some inconsistency still remains.
Here comes isolation level of the database, It is the factor that will define how much 2 transactions are isolated from each other, and how to schedule them to run concurrently.
You can set isolation level as per your requirements. Remember that, concurrency is inversely proportional to isolation level. So analyse your Read/Write statements, figure out which level you need. Do not use higher level then that. I am giving some links, which may help you
http://www.ibm.com/developerworks/data/zones/informix/library/techarticle/db_isolevels.html
Difference between read commit and repeatable read
http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.htm
If you can post your database operations here. I can suggest you some possible isolation level

Prevent parallel execution using a table lock (MySQL)

I have a MySQL table called cronjobs which holds entires for every cronjob needed (e.g. delete old emails, update profile age, and so on). For every cronjob there is a defined code block which gets executed if the cronjob is due (I got different intervals for different cronjobs).
For the execution of the due cronjobs, I got a PHP script which is executed by UNIX crontab every minute (calls execute_cronjobs_due.sh which calls "php -f /path/to/file/execute_cronjobs_due.php").
When executing execute_cronjobs_due.php all cronjobs get marked that they are going to be executed, so that another call of execute_cronjobs_due.php wouldn't cause a parallel execution of the same cronjob getting already executed.
Now the problem: Sometimes the execution takes more than 60 seconds but the crontab program does not call execute_cronjobs_due.sh after these 60 seconds. What actually happens is that execute_cronjobs_due.sh is called right after the execution of the execution of the previous crontab. And if an execution takes more than 120 seconds, the next two executions are initialize simultaneously.
Timeline:
2015-06-15 10:00:00: execution execute_cronjobs_due.sh (takes 140 seconds)
2015-06-15 10:02:20: two simultaneous executions of execute_cronjobs_due.sh
Since it is executed exactly simultaneous, there is no use of marking the cronjob that they are being executed since the selects (which should actually exclude the marked once) are executed at the exact same time. So the update occurs right after both already selected the due cronjobs.
How can I solve this problem, so that there are no simultaneous executions of cronjobs? Can I use MySQL table locks?
Thank you very much for your help in advance,
Frederic

Yes you could use mysql table locks, but this may be overkill for your situation. Anyway to do that in most generic way
Make sure that you have autocommit off
LOCK TABLES cronjobs;
do your stuff
UNLOCK TABLES
for exact syntax and details read the docs obviusly https://dev.mysql.com/doc/refman/5.0/en/lock-tables.html , I personally never used table level locking so maybe there are some catches involved I am not aware of.
What I would do, if you use InnoDB table engine is to go with optimistic locking:
start transaction as a first thing in your script
get some id of script or whatever, might be process pid (getmypid()) or combination of host+pid. Or just generate guid if you don't know which will be perfect
do something like UPDATE cronjobs SET executed_by = my_id WHERE executed_by is null and /* whatever condition to get jobs to run */
then SELECT * FROM cronjobs where executed_by = my_pid
do your stuff on whatever above select returned
UPDATE cronjobs set executed_by = null where executed_by = my_pid
This should be as easy to do, easier to track what happens and scale in the future (i.e. you can have few instances running running in parallel as long as they execute different scripts)
With this solution second script will not fail (technically), it will just run 0 jobs.
Minus is that you will have to clean jobs that were claimed but script failed to mark them as finished, but you probably have to do it anyway with current solution. The easiest way would be to add a timestamp column that would track when was the job claimed last time and expire it after i.e. 15 minutes or an hour depending on business requirements (short pseudocode: first update will do SET executed_by = my_id, started_at = NOW() where executed_by is null or (executed_by is not null and started_at < NOW() - 1 hour))

How can I solve this problem, so that there are no simultaneous executions of cronjobs?
There are multiple ways to solve this. They might be helpful as-well:
My suggestion is to keep it simple and use either a file-locking or file-exist checking approach.
file_exist() + PID based CronHelper Class
http://abhinavsingh.com/how-to-use-locks-in-php-cron-jobs-to-avoid-cron-overlaps/
flock() based: https://stackoverflow.com/a/5428665/1163786
when you want to avoid IO, store the locking-state into memcache
database transactions: see below and #sakfa's answer
lock cronjobs across a distributed system using Redis as central: https://github.com/kvz/cronlock & http://kvz.io/blog/2012/12/31/lock-your-cronjobs/
Can I use MySQL table locks?
Yes, but it's a bit overkill.
You would use a "cronjob processing table" with a cronjob status column ("ToDo, Started, Complete" or "Todo, Running, Done") and a PID column.
Then you select jobs and mark their state by using transactions.
That makes sure that "Selecting a job from Todo" and "marking it as running/started" is done in one step. In the end, you might still have multiple exec's of your "central cronjob processing script", but jobs are NOT selected multiple times for processing.

What would be the best way to queue mysql queries? In terms of speed

When a user submits a form on my site, I have to do a job based on the form which is essentially:
Check for user locks (in redis, prevents user from doing naughty things), if no locks continue and put job queue lock in place, otherwise quit job and give error to user
Update row/s in a mysql table, potentially delete some rows in the same table and do at least 1 insert (potentially across different tables)
remove job queue lock
I would like to queue these jobs up as they come in, with the queue always processing new jobs that get put into it.
I am using php and mysql. I have looked at gearman and also resque for php. Gearman seems like it might be overkill. And also I want to potentially be able to handle thousands of these jobs per second. So speed is important.
It's crucial that these jobs in the queue occur sequentially and in the order they come in. It would also be a bonus if every half a second I could insert a job to the front of the queue (it's a different job but kind of related).
I've never done anything like this before.

Since you're already into PHP & Redis it looks like Resque may work for you.

PHP scripts in cron jobs are double processing

I have 5 cron jobs running a PHP file. The PHP file checks the MySQL database for items that require processing. Since cron launches the scripts all at the same time, it seems that some of the items are processed twice, or even sometimes up to five times.
Upon SELECting the file in one of the scripts, it immediately sends an UPDATE query so that other jobs shouldn't run it again. But looks like it's still double processing.
What can I do to prevent the other scripts from processing an item that was previously selected by the other cron jobs?

This issue is called "race condition". In this case it happens due to SELECT and UPDATE, though called one after another, are not a single operation. Therefore, there is a chance that two jobs do SELECT the same job, then first does UPDATE, and then second does UPDATE. And so they proceed to run this job simultaneously.
There is a workaround, however.
You could add a field to your table containing ID of current cron job worker (if you run it all on one machine, it may be PID). In worker you do UPDATE first, trying to reserve a job for it:
UPDATE jobs
SET worker = $PID, status = 'processing'
WHERE worker IS NULL AND status = 'awaiting' LIMIT 1
Then you verify you successfully reserved a job for this worker:
SELECT * FROM jobs WHERE worker = $PID
If it did not return you a row, it means other worker was first to reserve it. You can try again from step 1 to aquire another job. If it did return a row, you do all your processing, and then final UPDATE in the end:
UPDATE jobs
SET status = 'done', worker = NULL
WHERE id = $JOB_ID

I think you have a typical problem to use semaphores. Take a look at this article:
http://www.re-cycledair.com/php-dark-arts-semaphores
The idea would be at first of each script, ask for the same semaphore and wait until it be free. Then SELECT and UPDATE the DB as you do it, free the semaphore and start the process. This is the only way you can be sure that no more than one script is reading the DB while another one is about to write on it.

I would start again. This train of thought:
it takes time to process one item. about 30 seconds. if i have five cron jobs, five items are processed in 30 seconds
This is just plain wrong and you should not write your code with this in mind.
By that logic why not make 100 cron jobs and do 100 per 30 seconds? Answer, because your server is not RoadRunner and it will fall over and fail.
You should
Rethink your problem, this is the most important as it will help with 1 and 2.
Optimise your code so that it does not take 30 seconds.
Segment your code so that each job is only doing one task at a time which will make it quicker and also ensure that you do not get this 'double processing' effect.
EDIT
Even with the new knowledge of this being on a third party server my logic still stands, do not start multiple calls that you are not in control of, in fact this is now even more important.
If you do not know what they are doing with the calls then you cannot be sure they are in the right order, when or if they are processed. So just make one call to ensure you do not get double processing.
A technical solution would be for them to improve the processing time or for you to cache the responses - but that may not be relevant to your situation.

How to implement a manager of scripts execution in php on a remote server

I'm trying to build a service that will collect some data form web at certain intervals, then parse those data, finally upon result of parse - execute dedicated procedures. Typical schematic of service run:
Request item list to be updated to
Download data of listed items
Check what's not updated yet
Update database
Filter data that contains updates (get only highest priority updates)
Perform some procedures to parse updates
Filter data that contains updates (get only medium priority updates)
Perform some procedures to parse ...
...
...
Everything would be simple if there ware not so many data to be updated.
There is so many data to be updated that at every step from 1 to 8 (maybe besides 1) scripts will fail due to restriction of 60 sec max execution time. Even if there was an option to increase it this would not be optimal as the primary goal of the project is to deliver highest priority data as first. Unlucky defining priority level of an information is based on getting majority of all data and doing lot of comparisons between already stored data and incoming (update) data.
I could resign from the service speed to get at least high priority updates in exchange and wait longer time for all the other.
I thought about writing some parent script (manager) to control every step (1-8) of service, maybe by executing other scripts?
Manager should be able to resume unfinished step (script) to get it completed. It is possible to write every step in that way that it will do some small portion of code and after finishing it mark this small portion of work as done in i.e. SQL DB. after manager's resuming, step (script) will continue form the point it was terminated by server due to exceeding max exec. time.
Known platform restrictions:
remote server, unchangeable max execution time, usually limit to parse one script at the same time, lack of the access to many apache features, and all the other restrictions typical to remote servers
Requirements:
Some kind of manager is mandatory as besides calling particular scripts this parent process must write some notes about scripts that ware activated.
Manager can be called by crul, one minute interval is enough. Unlucky, making for curl a list of calls to every step of service is not an option here.
I also considered getting new remote host for every step of service and control them by another remote host that could call them and ask for doing their job by using ie SOAP but this scenario is at the end of my list of wished solutions because it does not solve problem of max execution time and brings lot of data exchange over global net witch is the slowest way to work on data.
Any thoughts about how to implement solution?

I don't see how steps 2 and 3 by themself can execute over 60 seconds. If you use curl_multi_exec for step 2, it will run in seconds. If you are getting your script over 60 seconds at step 3, you would get "memory limit exceeded" instead and a lot earlier.
All that leads me to a conclusion, that the script is very unoptimized. And the solution would be to:
break the task into (a) what to update and save that in database (say flag 1 for what to update, 0 for what not to); (b) cycle through rows that needs update and update them, setting flag to 0. At ~50 seconds just shut down (assuming that script is run every few minutes, that will work).
get a second server and set it up with a proper execution time to run your script for hours. Since it will have access to your first database (and not via http calls), it won't be a major traffic increase.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.