I'm trying to wrap my head around the message queue model and jobs that I want to implement in a PHP app:
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal.
I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Here's a diagram of what I think I should do:
Questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
Isn't it more effective and faster to also send a notification to the main worker whenever a message is sent vs polling the MQ? I assume I would the need to use gearman to notify my main worker that the MQ has a message, so it can start checking it. or if I have 300 messages per second, this would generate 300 jobs to check the MQ?
Basically how could I check the MQ as efficiently and as effectively as possible?
Suggestions or corrections to my architecture?
My suggestions basically boil down to: Keep it simple!
With that in mind my first suggestion is to drop the DispatcherWorker. From my current understanding, the sole purpose of the worker is to listen to the MAIN queue and forward messages to the different task queues. Your application should take care of enqueuing the right message onto the right queue (or topic).
Answering your questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
Yes, there is no free lunch. Of course you could adapt and optimize your worker poll rate by application usage (when more messages arrive increase poll rate) by day/week time (if your users are active at specific times), and so on. Keep in mind that engineering costs might soon be higher than unoptimized polling.
Instead, you might consider push queues (see below).
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
This sounds too complicated. Communication is unreliable, there are reliable message queues however. If you don't want to loose data, stick to the message queues and don't invent custom protocols.
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
As already mentioned, the application should enqueue your message to multiple queues as needed. This keeps things simple and in place.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
There are so many message queues and even more ways to use them. In general, if you are using poll queues you'll need to keep your workers alive by yourself. If however you are using push queues, the queue service will call an endpoint specified by you. Thus you'll just need to make sure your workers are available.
Basically how could I check the MQ as efficiently and as effectively as possible?
This depends on your business requirements and the job your workers do. What time spans are critical? Seconds, Minutes, Hours, Days? If you use workers to send emails, it shouldn't take hours, ideally a couple of seconds. Is there a difference (for the user) between polling every 3 seconds or every 15 seconds?
Solving your problem (with push queues):
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal. I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Indeed the scenario you describe is a good fit for message queues.
As you mentioned you don't want to manage the message queue itself, maybe you do not want to manage the workers either? This is where push queues pop in.
Push queues basically call your worker. For example, Amazon ElasticBeanstalk Worker Environments do the heavy lifting (polling) in the background and simply call your application with an HTTP request containing the queue message (refer to the docs for details). I have personally used the AWS push queues and have been happy with how easy they are. Note, that there are other push queue providers like Iron.io.
As you mentioned you are using PHP, there is the QPush Bundle for Symfony, which handles incoming message requests. You may have a look at the code to roll your own solution.
I would recommend a different route, and that would be to use sockets. ZMQ is an example of a socket based library already written. With sockets you can create a Q and manage what to do with messages as they come in. The machine will be in stand-by mode and use minimal resources while waiting for a message to come in.
Related
I have implemented rabbitMQ in my current php application to handle asynchroneous jobs that are handled by workers. But my current problem is that how should i monitor and scale up or down the workers. Also, i want to add error handling in case all the workers die. I have thought of following two ways but don't know which one is the better:
At producer end, i would analyze the rabbitMQ queue size. If queue size (list of pending tasks) is more than a threshold, i would create one new worker everytime producer script executes but before that i would check the server load (using linux command uptime). If server load is less than a threshold then only new worker would be created. At consumer end (in worker.php), i would apply same method to scale up the workers and i would also check that if script is idle for a given time (i.e. there is no pending task in rabbit mq queue) then it would automatically die (to automate scaling down of workers).
Second method is to use background process or cron to monitor and scale/up down the workers. But i don't want to rely on cron (as i have very bad experiences with it) or background process because if background process crashes for some reason then there is no way to recover from it.
Please help.
I wouldn't recommend bothering to scale them down to nothing when there's no work to be done. The worker that's left (if you want to scale back to 1) will simply wait for something else to consume and it's not an expensive operation.
In terms of determining whether to scale up, I'd recommend leveraging the RabbitMQ Management HTTP API (http://hg.rabbitmq.com/rabbitmq-management/raw-file/3646dee55e02/priv/www-api/help.html). You can use the queue related aspects via a GET operation to get information about queues, including how many entries are currently waiting to be processed.
With that info, you can decide to scale if it either hits a certain threshold, or keeps increasing with every check for a certain amount of time, or something similar. This can be done from the consumer side.
In terms of error handling, I would recommend encapsulating the RabbitMQ connection aspect of your workers such that if a RabbitMQ exception occurs the connection is re-established from scratch and continues.
If it's a more serious type of exception that isn't RabbitMQ-related, you may need to catch it at such a level where the worker basically spawns a new worker before it dies. Then of course there are other types of exceptions (out of memory conditions, for example), where it really isn't feasible to try to continue and your program should just completely die.
It is very difficult to answer your question with any degree of accuracy since there are many aspects of the context which are not included.
How long do the tasks take to execute?
Why do you want to scale up/down? Why don't you have threads waiting for load in the first place?
That being said, coming from the world of Erland and functional programming (which is the language used to power RabbitMQ) I would like to suggest the concept of a SUPERVISOR thread. This thread would have the following responsibilities:
Spawn threads depending on the load/qty of requests
Discard threads depending on the load/qty of requests
Monitor the children threads and re-launch them as required reprocessing the same messages if necessary or discarding them
The Supervisor thread should be as easy as possible and should be built in such a way that it simply loops, sleeps and checks if all the threads that need to be alive actually are - it can then check the load and spawn up or kill off the workers as needed. Or in other words, spawn more and/or not-spawn depending on your needs.
You could easily use an exchange to send messages to both the supervisor and the worker queues where the supervisor would then be able to keep a record/count of the messages in the queue without having to write polling code to the server, it would simply listen to it's own queue. You can increase/dec the counter from the supervisor thread and manage everything from there.
Hope this helps.
See: http://docs.dotcloud.com/guides/daemons/
Regretfully I don't program in PHP and therefore cannot give you PHP-specific assistance, this is however the programming pattern that I recommend that you use. If PHP doesn't allow multi-threaded programming and/or threads then I would highly recommend that you use a language that does since you will not be able to scale and use the full power of the local machine unless you use multiple threads. As for the supervisor crashing, if you keep minimal work in the supervisor and delegate all responsibilities to children threads then the risk of a supervisor crash is minimal.
Perhaps this will help:
Philosophy:
http://soapatterns.org/design_patterns/service_agent
PHP-specific:
http://www.quora.com/PHP-programming-language-1/Is-there-an-actor-framework-for-php
We're streaming a high volume of data server-side into BigQuery using the google-api-php-client library. The streaming works fine apart from the performance.
Our load testing is giving us an average time of 1000ms (1 sec) to stream one row into BigQuery. We can't have the client waiting for more than 200ms. We've tested with smaller payloads and the time remains the same. Async calls on the client side is not an option for us.
The 'bottleneck' line of code is:
$service->tabledata->insertAll(PROJECT_NUMBER, DATA_SET, TABLE, $request);
Having looked under the hood of the library the call to insert the row is simply a cURL request (Curl.php in the library).
Is there any way to modify the insertAll() to make it faster? We don't care about the result so a fire-and-forget would work for us. We've tried setting CURLOPT_CONNECTTIMEOUT_MS and CURLOPT_TIMEOUT_MS in the underlying cCURL request but it does not work.
Reading all your comments, and side notes. The approach you've chosen does not scale, and won't scale. You need to rethink the approach with async processes.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.
The scenario is - I am building a message queue model using RabbitMQ and phpamqplib. This model will have 15 programs each program will consume a message from a queue and publish a message to another queue. All these queues are different (i.e. around 30 queues). But I want to use only 2 connections across all these programs one for publishing and another one for consuming. I don't want to create broker connections in each of the program. I am not able to understand how to do it? Any help? Thanks in advance.
If you want to use 2 connections then 15 producers and consumers should be part of a single process and run as threads. In addition two threads one for consuming and other for publishing.
The consumer thread consumes messages and pushes them to remaining worker thread pool.
Once the worker threads have completed their work, response is pushed to an internal storage inside publisher,which in turn pops the response onto rabbit queues.
. Few points to keep in mind are:
Throughput: Number of consumers and producers is decided on basis of throughput you want to achieve for your application.
Scalability, if you have fixed number of consumers and producers then you might be able to scale your application to a limit.
Flow control: number of consumers can be crucial in avoiding connection based flow control.
Internal message caching by consumer thread (Qos). Set a well defined QOs value as per the throughout desired.
Also explore if multi-threading is supported by amqp library you desire to use. If yes then you could share the connection across threads.
I'm writing a web app in PHP + Laravel + MySQL.
In the system, a user can schedule emails (and other API calls) at arbitrary times (much like how you schedule posts in WordPress). I can use CRON to inspect the database every 5min or so to find emails that should be sent, send them, and update their status.
However, this is a SaaS app. So the amount of emails to be sent at a particular time can grow rapidly. I can create a "lock file" every time the CRON script runs so that only one instance of it is running at a time. The lock file will be deleted after a script finishes execution.
But with potentially large data, I would want a way to process multiple messages simultaneously, potentially using multiple "workers." Is there any existing solution manage such a queue?
Yes! Task/Message/Job queues are what you are looking for! They allow you to put various tasks in queues from which you can retrieve them and process them, this process can scale horizontally as each worker can pull a task once its finished with the previous one.
You should have the cron maybe every minute/two minutes that just uploads the task and what needs to be done. This will make sure the cron is very quick.
Take a look at Iron.io Here is an extract from the website which gives a nice overview of these kinds of systems:
An easy-to-use scalable task queue that gives cloud developers a
simple way to offload front-end tasks, run scheduled jobs, and process
tasks in the background and at scale.
Gearman is also a great solution that you can use yourself and is very simple. You can send the message in many different languages and use a different langauge to process it. Say PHP -> C etc...
The Wikipedia link will tell you everything you need to know, here is a quick excerpt:
Message queues provide an asynchronous communications protocol,
meaning that the sender and receiver of the message do not need to
interact with the message queue at the same time. Messages placed onto
the queue are stored until the recipient retrieves them.
I have a script which picks up messages from a queue, this does the pre-processing required for the other processes to work.
Now, these messages have to be delivered so I need to ack these messages and if one of the services listening for messages goes down then it should receive the messages it missed when it comes back online.
A couple questions:
1/ Does it make sense to have a queue for each post-processing service which is added to everytime the pre-processing runs? (So I might add to 8 different queues at the same time following each process - this will be a ton of messages (hundreds of thousands p/day).
2/ How quick is it to add messages to a queue? Is adding to 8-10 queues going to slow down my software?
3/ Can I use a topic exchange to do this with fanout? My only concern is if one of my services goes down they will miss the message.
4/ Any tips from persons with experience?
A few thoughts:
If your post-processors are each doing a different 'job' then it makes sense to have queues for them to consume from. If you just have a bunch of post-processors all doing the same task, then you only need to have one queue from which they can all consume messages from.
Adding messages to queues is FAST, adding queues into RabbitMQ is fast, binding the queues to exchanges is fast. The thing that will slow down your system would be the size of the messages and the number that you are likely to receive, and then how much processing actually needs to be done.
The other consideration is to do with persistence of messages, should your messages survive a restart of RabbitMQ, that is, how critical are they? if it is critical that they not be lost (which by the sounds of your question it is) then you will need to make sure they are persisted. If you look at the RabbitMQ documentation you will see that there is a significant cost in doing this.
This depends on what your system is actually doing...Topcis are good, Fanouts are good, but what your system does depends on which is applicable.
I would highly recommend reading RabbitMQ in Action it is an excellent resource and well worth the money.