I'm working on an existing custom eCommerce PHP application which is currently running a very resource intensive CRON every 15 minutes.
The gist of it is: Customers can set up complex filters for products they are interested in, and receive emails based on these filters. The CRON which runs every 15 minutes, checks for all new products that have been listed since it last ran, and compares them with each customers filters. If the products match a customers filters they are send an email via amazon SES.
Up until now, this method has been working ok, but as the number of active customers is rising very quickly the CRON is starting to make a noticable performance drop on the application every 15 minutes which lasts for a minute or two while it runs.
I have been toying with other ideas to help spread out the load on the server, such as performing the task each time a product is listed, so the server doesn't need to catch up on multiple products at a time.
What is usually the best practise when approaching something like this?
My recommended approach is to use a rabbitmq queue where your cron will send messages. Then set up a couple of consumers(scripts that will wait at the other end of the queue) that will take the message one by one, compose email and send it to the customer.
This way, you can scale the number of consumers to match the volume of emails needed to be sent.
If queues are not something you're familiar with, take a look at the RabbitMQ tutorials : https://www.rabbitmq.com/tutorials/tutorial-one-php.html
The message queues fit perfectly well and you can easily make use of them with enqueue library. Just a few words on why should you choose it:
It supports a lot of transports from the simplest one (filesystem) to enterprise ones (RabbitMQ or Amazon SQS).
It comes with a very powerful bundle.
It has a top level abstraction which could be used with the greatest of ease.
There are a lot more which might come in handy.
Instead of defining cron tasks you need a supervisord (or any other process manager). It must be configured to run a consume command. More on this in the doc.
Whenever a message is published it is delivered to a consumer (by a broker) and is being processed.
I suggest using RabbitMQ broker.
Related
I am developing a Web Application for businesses to track the status of their repairs & part orders that is running on LAMP (Linux Apache MySQL PHP). I just need some input as to how I should go about allowing users to customize the frequency of email notifications.
Currently, I just have a cron job running every Monday at 6:00AM that runs a php script that sends an email to each user of their un-processed jobs. But I would like to give users the flexibility of not only choosing the time they are sent at, but the days of the week as well.
One idea I had was, some way or another, storing their email notification preferences in a MySQL database, and then writing a php script to notify via email but only if the current date/time fits within the criteria they have set & write in code to prevent it from being sent twice within the same cycle. Then I could just run the cron job every minute or 5 or whatever.
Or would it be better to somehow create individual cron jobs for each user programatically via php?
Any input would be greatly appreciated! :)
No you are right.
Individual crons will consume many resources. Imagine 10k of users with a request to send mail at different times ... this imply 10k of tasks.
The best solution is to create a cron task that will run on your users and take the correct actions.
Iterate on your users, check the date/time set up, detect change and send mail with adding a flag somewhere so said "it's done" (an attribute last_cron_scandate or next_calculated_cron_scandate could be a good solution)
I am currently evalutuating Gearman to farm out some expensive data import jobs in our backend. So far this looks very promising. However there is one piece missing that I just can't seem to find any info about. How can I get a list of schedules jobs from Gearman?
I realize I can use the admin protocol to get the number of currently queued jobs for each function, but I need info about the actual jobs. There is also the option of using a persistent queue (eg. MySQL) and query the database for the jobs, but it feels pretty wrong to me to circumvent Gearman for this kind of information. Other than that, I'm out of ideas.
Probably I don't need this at all :) So here's some more background on what I want to do, I'm all open for better suggestions. Both the client and the worker run in PHP. In our admin interface the admins can trigger a new import for a client; as the import takes a while it is started as a background task. Now the simple questions I want to be able to answer: When was the last import run for this client? Is an import already queued for this client (in that case triggering a new import should have no effect)? Nice to have: At which position in the queue is this job (so I can make an estimate on when it will run)?
Thanks!
The Admin protocol is what you'd usually use, but as you've discovered, it won't list the actual tasks in the queue. We've solved this by keeping track of the current tasks we've started in our application layer, and having a callback in our worker telling the application when the task has finished. This allows us to perform cleanup, notification etc. when the task has finished, and allows us to keep this logic in the application and not the worker itself.
Relating to progress the best way is to just use the built-in progress mechanics in Gearman itself, in the PHP module you can call this by using $job->sendStatus(percentDone, 100). A client can then retrieve this value from the server using the task handle (which will be returned when you start the job). That'll allow you to show the current progress to users in your interface.
As long as you have the current running tasks in your application, you can use that to answer wether there are similar tasks already running, but you can also use gearman's built-in job coalescing / de-duplication; see the $unique parameter when adding the task.
The position in the current queue will not be available through Gearman, so you'll have to do this in your application as well. I'd stay away from asking the Gearman persistence layer for this information.
You have pretty much given yourself the answer: use a DBRMS (MySQL or Postgres) as persistance backend and query the gearman_queue table.
For instance, we developed a hybrid solution: we generate and pass an unique id for the job which we pass as third parameter to doBackground() (http://php.net/manual/en/gearmanclient.dobackground.php) when queuing the job.
Then we use this id to query the gearman table to verify the job status looking at the 'unique_key' table field. You can also get the queue position as the record are already ordered.
Pro Bonus: we also catch exceptions inside the worker. If a job fails we write the job payload (which is a JSON serialized object) on a file, and then pick up the file and requeue the job via cronjob incrementing the 'retry' internal counter so we retry a single job 3 times max, and get to inspect the job later if it still fails.
I'm looking for a queuing system that could support the following scenario:
A client adds a job - to check how many Facebook likes a particular url (URL1) has;
A client adds another job - to check the same information for URL2;
[....]
A worker picks up anything from 1 to 50 jobs (urls) from the queue (e.g., if there's only 5 - it picks up 5, if there's 60 - picks up 50, leaving others for another worker), and issues a request against Facebook API (which allows multiple urls per request). If it succeeds, all jobs are taken out of the queue, if it fails - all of them stay.
I'm working with PHP and I've looked into Gearman, Beanstalkd, but did not find any similar functionality. Is there any (free) queuing system that would support such a "batch-dequeuing"?
Or, maybe, anybody could suggest an alternative way of dealing with such an issue? I've considered keeping a list of "to check" urls outside the queuing system and then adding them in bundles of max N items with a cron job that runs every X period. But that's kind of building your own queue, which defeats the whole purpose, doesn't it?
I've used Beanstalkd to fetch 100 twitter names at a time, and then calling an API with them all. When I was done, I deleted them - but I could have elected to not delete some (or all) if I wished.
It was a simple loop to reserve the initial 100 (one at a time), and I put the results (the job ID and the data returned) into an array. When I was done dealing with the payload (in this instance, a twitter screen-name), I went through deleting them - but I could have easily have released them back into the queue.
Perhaps you could take inspiration from MediaWiki's job queue system. Not very complicated, but it does have some issues that you may run into if you decide to roll your own.
The DB tables used for this are defined here.
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.
I am trying to find out the difficulty of implementing a queue system. I know how to implement a basic queue, so i'll explain a little about what i'm after with some background:
I will be implementing a queue where messages will be placed, this will come from several users, the messages will be scheduled to be posted at user defined times (multiple occurrences are allowed with the precision of Minutes, from a UI perspective i will be restricting: "every minute or every hour" occurrences but id like the system to still be able to handle this).
Here is where my question comes in:
Eventually I may be in a situation (and maybe not) where MANY messages need to be posted at the current time, I'd like to have several processes (multiple instances of a script) running to fetch [x,10,25] number of messages from the queue at a time and process them. The problem is: how to do this so that each instance processes unique messages (without processing something that is already being processed by another instance)? I'm worried about current connections, how to lock records, and anything else i may not be thinking about.
Technologies I will be using are PHP and MySQL. I am looking for some solutions to the above, terms I should be using in my searches, real world examples, thoughts, comments and ideas?
Thanks you all!
One solution i came across was using Amazon Simple Queue Service ... it promises unique message processing/locking http://aws.amazon.com/sqs/
Well, I'd do it like this:
Make your table for messages and add two more fields - "PROCESS_ID" and "PROCESS_TIME". These will be explained later.
Give each process a unique ID. They can generate it at the startup (like a GUID), or you can assign them yourself (then you can tell them apart more easily).
When a process wants to fetch a bunch of messages, it then does something like this:
UPDATE messages SET process_id=$id, process_time=now() where process_id is null LIMIT 20
SELECT * FROM messages WHERE process_id=$id
This will find 20 "free" messages and "lock" them. Then it will find the messages that it locked and process them. After each message is processed, DELETE it.
The UPDATE statement should be pretty atomic, especially if you use InnoDB, which wraps each such statement in a transaction automatically. MySQL should take care of all the concurrency there.
The PROCESS_TIME field is optional, but you can use that to see when a process has hanged. If a message is locked for too long, you can conclude that something went wrong and investigate.
You could turn the problem around.
Instead of having the problem of getting things out of the queue at the same time. Publish all info as soon as you get it. But publish it with a rule that it is not suposed to be visible until a certain time. Doing things in this way could help you avoid locking / contention problems.
Have a look at the Beanstalkd message queue. There are PHP clients for it. One of the nice features of Beanstalkd (as opposed to e.g. dropr) is that you can delay messages. That is, you can post a message to the queue and it will not be delivered to a client until X seconds have passed.
Beanstalkd does have one big downside though: It's an in-memory queue. That means if it (or your machine) crashes then the queue is empty and the contents lost. Persistence is a feature planned for the next version of beanstalkd.
Couple of online solutions:
Amazon SQS.
Google appengine queue system
I guess the google solution is much cheaper (Could even be free if not using much).
I have also been thinking about implementing queue in PHP/MYSQL and thought of using:
mysql get_lock to implement some sort of lock.
Put the queue in MYSQL memory heap datastorage, because in memory queue is much faster then on disc queue. But you have the risk of losing data when computer crashes.
Use named pipes to communicate with processes.