Handle Queue race condition in PHP Symfony with MySQL Database

Handle Queue race condition in PHP Symfony with MySQL Database - php

I've an application in Symfony that needs to send Emails/Notificatios from the App.
Since the Email/Notifications sending process takes time, so I decided to put them in Queue and process the Queue periodically. Hence I can decrease the response time for the Requests involving the Email/Notification dispatch.
The Cron Job(a php script - Symfony route) to process the queue runs every 30 seconds, and checks if there are any unsent Emails/Notifications if found it gets all data from the Queue Table and starts sending them. When an Email/Notification is sent, the Queue Table row status flag is updated to show that it's sent.
Now, when there are more Emails in Queue which could take more than 30 seconds to send. Another Cron Job also start running and starts sending emails from the Queue. Hence resulting in duplicate Emails/Notifications dispatch.
My Table structure for Email Queue is as follows :
|-------------------------------------|
| id | email | body | status | sentat |
|-------------------------------------|
My Ideas to resolve this issue are as follows :
Set a flag in Database that a Cron Job is running, and no other Cron Jobs should proceed if found the flag set.
Update status as 'sent' for all records and then start sending Emails/Notifications.
So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?

So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?
You can take enqueue-bundle plus doctrine dbal transport.
It already takes care of race conditions and other stuff.

Regarding your suggestions:
What if the cronjob process dies (for whatever reason) and cannot clean up the flag? A flag is not a good idea, I think. If you would like to follow this approach, you should not use a boolean, but rather either a process ID or a timestamp, so that you can check if the process is still alive or if it started a suspiciously long time ago without cleaning up.
Same question: what if the process dies? You don’t want to mark the mails as sent before they are sent.
I guess I’d probably use two fields: one for marking a record as “sending in progress” (thus telling other processes to skip this record) and another one for marking it as “sending successfully completed”. I’d write a timestamp to both, so that I can (automatically or manually) find those records where the “sending in progress” is > X seconds in the past, which would be an indicator for a died process.

You can use Database Transactions here. Rest will be handled by database locking mechanism and concurrency control. Generally whatever DML/DCL/DDL commands you are giving, are treated as isolated Transactions. In your Question, if 2nd cron job will read the rows(before 1st cron job will update it as sent) , it will find the email unsent, and try to send it again. and before 2nd cron job will update it as sent, if 3rd job will find it unsent, it will do same. So it can cause big problem for you.
whatever approach you will take, there will be Race Condition. So let the database allow to do it. there are many concurrency control methods you can refer.
BEGIN_TRANSACTION
/* Perform your actions here. N numbers of read/write */
END_TRANSACTION
Still there is one problem with this solution. You will find at one stage that, when number of read/write operation will increase, some inconsistency still remains.
Here comes isolation level of the database, It is the factor that will define how much 2 transactions are isolated from each other, and how to schedule them to run concurrently.
You can set isolation level as per your requirements. Remember that, concurrency is inversely proportional to isolation level. So analyse your Read/Write statements, figure out which level you need. Do not use higher level then that. I am giving some links, which may help you
http://www.ibm.com/developerworks/data/zones/informix/library/techarticle/db_isolevels.html
Difference between read commit and repeatable read
http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.htm
If you can post your database operations here. I can suggest you some possible isolation level

Related

What would be the best way to queue mysql queries? In terms of speed

When a user submits a form on my site, I have to do a job based on the form which is essentially:
Check for user locks (in redis, prevents user from doing naughty things), if no locks continue and put job queue lock in place, otherwise quit job and give error to user
Update row/s in a mysql table, potentially delete some rows in the same table and do at least 1 insert (potentially across different tables)
remove job queue lock
I would like to queue these jobs up as they come in, with the queue always processing new jobs that get put into it.
I am using php and mysql. I have looked at gearman and also resque for php. Gearman seems like it might be overkill. And also I want to potentially be able to handle thousands of these jobs per second. So speed is important.
It's crucial that these jobs in the queue occur sequentially and in the order they come in. It would also be a bonus if every half a second I could insert a job to the front of the queue (it's a different job but kind of related).
I've never done anything like this before.

Since you're already into PHP & Redis it looks like Resque may work for you.

Best solution for running multiple intensive jobs at specific times

We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.

Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.

Should I be using message queuing for this?

I have a PHP application that currently has 5k users and will keep increasing for the forseeable future. Once a week I run a script that:
fetches all the users from the database
loops through the users, and performs some upkeep for each one (this includes adding new DB records)
The last time this script ran, it only processed 1400 users before dieing due to a 30 second maximum execute time error. One solution I thought of was to have the main script still fetch all the users, but instead of performing the upkeep process itself, it would make an asynchronous cURL call (1 for each user) to a new script that will perform the upkeep for that particular user.
My concern here is that 5k+ cURL calls could bring down the server. Is this something that could be remedied by using a messaging queue instead of cURL calls? I have no experience using one, but from what I've read it seems like this might help. If so, which message queuing system would you recommend?
Some background info:
this is a Symfony project, using Doctrine as my ORM and MySQL as my DB
the server is a Windows machine, and I'm using Windows' task scheduler and wget to run this script automatically once per week.
Any advice and help is greatly appreciated.

If it's possible, I would make a scheduled task (cron job) that would run more often and use LIMIT 100 (or some other number) to process a limited number of users at a time.

A few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but more than 30 seconds would be a start.
Track Upkeep against Users
Maybe add a field for each user, last_check and have that field set to the date/time of the last successful "Upkeep" action performed against that user.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.

Why don't you still use the cURL idea, but instead of processing only one user for each, send a bunch of users to one by splitting them into groups of 1000 or something.

Have you considered changing your logic to commit changes as you process each user? It sounds like you may be running a single transaction to process all users, which may not be necessary.

How about just increasing the execution time limit of PHP?
Also, looking into if you can improve your upkeep-procedure to make it faster can help too. Depending on what exactly you are doing, you could also look into spreading it out a bit. Do a couple once in a while rather than everyone at once. But depends on what exactly you're doing of course.

Postpone Job Queue with Gearman

I want to extract some of the time consuming things into a queue. For this I found Gearman to be the most used but don't know if it is the right thing for me.
One of the tasks we want to do is queue sending emails and want to provide the feature to be able to cancel to send the mail for 1 minute. So it should not work on the job right away but execute it at now + 1 minute. That way I can cancel the job before that and it never gets sent.
Is there a way to do this?
It will run on debian. And should be usable from php. The only thing I found so far was Schedule a job in Gearman for a specific date and time but that runs on something not widely spread :(

There are two parts to your question: (1) scheduling in the future and (2) being able to cancel the job until that time.
For (1) at should work just fine as specified in that question and the guy even posted his wrapper code. Have you tried it?
If you don't want to use that, consider this scenario:
insert an email record for the email to-be-sent in a database, including a "timeSent" column which you will set 1 minute in the future.
have a single gearman worker (I'll explain why single) look at the database for emails that have not been sent (eg some status column = 0) and where timeSent has already passed, and send those.
So, for (2), if you want to cancel an email before it's sent just update its status column to something else.
Your gearman worker has to be a single one because if you have multiple they might fetch and try to send the same email record. If you need multiple make sure the one that gets the email record first locks it immediately before any time consuming operations like actually emailing it (say, by updating that status column to something else).

what are queue access concurrency solutions?

I am trying to find out the difficulty of implementing a queue system. I know how to implement a basic queue, so i'll explain a little about what i'm after with some background:
I will be implementing a queue where messages will be placed, this will come from several users, the messages will be scheduled to be posted at user defined times (multiple occurrences are allowed with the precision of Minutes, from a UI perspective i will be restricting: "every minute or every hour" occurrences but id like the system to still be able to handle this).
Here is where my question comes in:
Eventually I may be in a situation (and maybe not) where MANY messages need to be posted at the current time, I'd like to have several processes (multiple instances of a script) running to fetch [x,10,25] number of messages from the queue at a time and process them. The problem is: how to do this so that each instance processes unique messages (without processing something that is already being processed by another instance)? I'm worried about current connections, how to lock records, and anything else i may not be thinking about.
Technologies I will be using are PHP and MySQL. I am looking for some solutions to the above, terms I should be using in my searches, real world examples, thoughts, comments and ideas?
Thanks you all!
One solution i came across was using Amazon Simple Queue Service ... it promises unique message processing/locking http://aws.amazon.com/sqs/

Well, I'd do it like this:
Make your table for messages and add two more fields - "PROCESS_ID" and "PROCESS_TIME". These will be explained later.
Give each process a unique ID. They can generate it at the startup (like a GUID), or you can assign them yourself (then you can tell them apart more easily).
When a process wants to fetch a bunch of messages, it then does something like this:
UPDATE messages SET process_id=$id, process_time=now() where process_id is null LIMIT 20
SELECT * FROM messages WHERE process_id=$id
This will find 20 "free" messages and "lock" them. Then it will find the messages that it locked and process them. After each message is processed, DELETE it.
The UPDATE statement should be pretty atomic, especially if you use InnoDB, which wraps each such statement in a transaction automatically. MySQL should take care of all the concurrency there.
The PROCESS_TIME field is optional, but you can use that to see when a process has hanged. If a message is locked for too long, you can conclude that something went wrong and investigate.

You could turn the problem around.
Instead of having the problem of getting things out of the queue at the same time. Publish all info as soon as you get it. But publish it with a rule that it is not suposed to be visible until a certain time. Doing things in this way could help you avoid locking / contention problems.

Have a look at the Beanstalkd message queue. There are PHP clients for it. One of the nice features of Beanstalkd (as opposed to e.g. dropr) is that you can delay messages. That is, you can post a message to the queue and it will not be delivered to a client until X seconds have passed.
Beanstalkd does have one big downside though: It's an in-memory queue. That means if it (or your machine) crashes then the queue is empty and the contents lost. Persistence is a feature planned for the next version of beanstalkd.

Couple of online solutions:
Amazon SQS.
Google appengine queue system
I guess the google solution is much cheaper (Could even be free if not using much).
I have also been thinking about implementing queue in PHP/MYSQL and thought of using:
mysql get_lock to implement some sort of lock.
Put the queue in MYSQL memory heap datastorage, because in memory queue is much faster then on disc queue. But you have the risk of losing data when computer crashes.
Use named pipes to communicate with processes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.