I'm writing a PHP project in Laravel. The admin can queue email alerts to be sent at a specific date/time. The natural choice was to use Laravel's Queue class with Beanstalkd (Laravel uses Pheanstalk internally).
However, the admin can choose to reschedule or delete email alerts that have not yet been sent. I could not find a way yet to delete a specific task so that I can insert a new one with the new timing.
What is the usual technique to accomplish something like this? I'm open to other ideas as well. I don't want to use CRON since the volumes of these emails would be pretty high, and I'd rather reuse an already tested solution for managing a task queue.
I have scoured the internet for such an answer myself.
Here's what I've found:
Beanstalkd
I think the simplest way within Beantalkd is to use its ability to delay a job (giving it a number of seconds to delay as an argument).
You can then do date math to capture the difference in time between now() and when the job ideally is run (in seconds) and delay the job that long.
Here's an example (see first answer to that SO question - it's in Python, but you can get the gist of what the person is saying/doing)
Note that this doesn't guarantee the task is run on time - the delay just make the job available after X seconds. How behind your queue is on processing tasks when the delayed job becomes available determines when the job is run in reality. (If you're queue is behind due to heavy number of jobs, then it won't necessarily run exactly on time!)
Laravel
Laravel's Queue has a later method, which you can use instead of push
Where as with Push, you'd do:
Queue::push('Some\Processing\Class', array('data' => $data));
with later(), you'd do:
$delay = 14400; // 4 hours in seconds
Queue::later($delay, 'Some\Processing\Class', array('data' => $data));
SaaS Options
Google App Engine can schedule tasks
iron.io can schedule tasks (works well with Laravel!)
Other languages
Python's Celery has similar options to delay jobs (with the same caveats)
Ruby's resque has a schedular that also sort of fits this
Related
I'm working on an existing custom eCommerce PHP application which is currently running a very resource intensive CRON every 15 minutes.
The gist of it is: Customers can set up complex filters for products they are interested in, and receive emails based on these filters. The CRON which runs every 15 minutes, checks for all new products that have been listed since it last ran, and compares them with each customers filters. If the products match a customers filters they are send an email via amazon SES.
Up until now, this method has been working ok, but as the number of active customers is rising very quickly the CRON is starting to make a noticable performance drop on the application every 15 minutes which lasts for a minute or two while it runs.
I have been toying with other ideas to help spread out the load on the server, such as performing the task each time a product is listed, so the server doesn't need to catch up on multiple products at a time.
What is usually the best practise when approaching something like this?
My recommended approach is to use a rabbitmq queue where your cron will send messages. Then set up a couple of consumers(scripts that will wait at the other end of the queue) that will take the message one by one, compose email and send it to the customer.
This way, you can scale the number of consumers to match the volume of emails needed to be sent.
If queues are not something you're familiar with, take a look at the RabbitMQ tutorials : https://www.rabbitmq.com/tutorials/tutorial-one-php.html
The message queues fit perfectly well and you can easily make use of them with enqueue library. Just a few words on why should you choose it:
It supports a lot of transports from the simplest one (filesystem) to enterprise ones (RabbitMQ or Amazon SQS).
It comes with a very powerful bundle.
It has a top level abstraction which could be used with the greatest of ease.
There are a lot more which might come in handy.
Instead of defining cron tasks you need a supervisord (or any other process manager). It must be configured to run a consume command. More on this in the doc.
Whenever a message is published it is delivered to a consumer (by a broker) and is being processed.
I suggest using RabbitMQ broker.
I want to setup a system for a privileged user to create a new task to run from date/time X to date/time Y saved in MySQL or SQLite? The task will send out a request to remote server via SSH and when the end date/time is up another SSH request would be sent.
What I'm not sure about is how to actually trigger the event at the start time and howto trigger the other at the end time?
Should I be polling the server somehow every 1min (sounds like a performance hit) or setup jobs in Iron.io/Amazon SQS or something else?
I noticed Amazon SQS only allows messages to queue for up to 14 days, how would that work for events weeks or months in the future?
Im not looking for code, just the idea on how it should work.
Basically there are two solutions, but maybe a hybrid version suits your problem best...
Use a queue (build into Laravel) and set up delayed jobs in the queue to be fired later on. You already mention that this might not be the best solution when a task takes months/weeks.
Use a cron job. Downside with this is that you can check once every day but that could mean a delay of 23h59m or you can check every minute but that might give you performance issues (in most cases it kind of works, but definitely not perfect).
Combining 1 & 2 might be the best solution; check in de beginning of a day whether there are tasks going to end in the coming day. If so, schedule a job in the queue to end the task at the exact time at which it should end. This gives you scalability and the possibility to create tasks that end a year after they where created.
I am currently evalutuating Gearman to farm out some expensive data import jobs in our backend. So far this looks very promising. However there is one piece missing that I just can't seem to find any info about. How can I get a list of schedules jobs from Gearman?
I realize I can use the admin protocol to get the number of currently queued jobs for each function, but I need info about the actual jobs. There is also the option of using a persistent queue (eg. MySQL) and query the database for the jobs, but it feels pretty wrong to me to circumvent Gearman for this kind of information. Other than that, I'm out of ideas.
Probably I don't need this at all :) So here's some more background on what I want to do, I'm all open for better suggestions. Both the client and the worker run in PHP. In our admin interface the admins can trigger a new import for a client; as the import takes a while it is started as a background task. Now the simple questions I want to be able to answer: When was the last import run for this client? Is an import already queued for this client (in that case triggering a new import should have no effect)? Nice to have: At which position in the queue is this job (so I can make an estimate on when it will run)?
Thanks!
The Admin protocol is what you'd usually use, but as you've discovered, it won't list the actual tasks in the queue. We've solved this by keeping track of the current tasks we've started in our application layer, and having a callback in our worker telling the application when the task has finished. This allows us to perform cleanup, notification etc. when the task has finished, and allows us to keep this logic in the application and not the worker itself.
Relating to progress the best way is to just use the built-in progress mechanics in Gearman itself, in the PHP module you can call this by using $job->sendStatus(percentDone, 100). A client can then retrieve this value from the server using the task handle (which will be returned when you start the job). That'll allow you to show the current progress to users in your interface.
As long as you have the current running tasks in your application, you can use that to answer wether there are similar tasks already running, but you can also use gearman's built-in job coalescing / de-duplication; see the $unique parameter when adding the task.
The position in the current queue will not be available through Gearman, so you'll have to do this in your application as well. I'd stay away from asking the Gearman persistence layer for this information.
You have pretty much given yourself the answer: use a DBRMS (MySQL or Postgres) as persistance backend and query the gearman_queue table.
For instance, we developed a hybrid solution: we generate and pass an unique id for the job which we pass as third parameter to doBackground() (http://php.net/manual/en/gearmanclient.dobackground.php) when queuing the job.
Then we use this id to query the gearman table to verify the job status looking at the 'unique_key' table field. You can also get the queue position as the record are already ordered.
Pro Bonus: we also catch exceptions inside the worker. If a job fails we write the job payload (which is a JSON serialized object) on a file, and then pick up the file and requeue the job via cronjob incrementing the 'retry' internal counter so we retry a single job 3 times max, and get to inspect the job later if it still fails.
I'm looking for a queuing system that could support the following scenario:
A client adds a job - to check how many Facebook likes a particular url (URL1) has;
A client adds another job - to check the same information for URL2;
[....]
A worker picks up anything from 1 to 50 jobs (urls) from the queue (e.g., if there's only 5 - it picks up 5, if there's 60 - picks up 50, leaving others for another worker), and issues a request against Facebook API (which allows multiple urls per request). If it succeeds, all jobs are taken out of the queue, if it fails - all of them stay.
I'm working with PHP and I've looked into Gearman, Beanstalkd, but did not find any similar functionality. Is there any (free) queuing system that would support such a "batch-dequeuing"?
Or, maybe, anybody could suggest an alternative way of dealing with such an issue? I've considered keeping a list of "to check" urls outside the queuing system and then adding them in bundles of max N items with a cron job that runs every X period. But that's kind of building your own queue, which defeats the whole purpose, doesn't it?
I've used Beanstalkd to fetch 100 twitter names at a time, and then calling an API with them all. When I was done, I deleted them - but I could have elected to not delete some (or all) if I wished.
It was a simple loop to reserve the initial 100 (one at a time), and I put the results (the job ID and the data returned) into an array. When I was done dealing with the payload (in this instance, a twitter screen-name), I went through deleting them - but I could have easily have released them back into the queue.
Perhaps you could take inspiration from MediaWiki's job queue system. Not very complicated, but it does have some issues that you may run into if you decide to roll your own.
The DB tables used for this are defined here.
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.