I have a script which picks up messages from a queue, this does the pre-processing required for the other processes to work.
Now, these messages have to be delivered so I need to ack these messages and if one of the services listening for messages goes down then it should receive the messages it missed when it comes back online.
A couple questions:
1/ Does it make sense to have a queue for each post-processing service which is added to everytime the pre-processing runs? (So I might add to 8 different queues at the same time following each process - this will be a ton of messages (hundreds of thousands p/day).
2/ How quick is it to add messages to a queue? Is adding to 8-10 queues going to slow down my software?
3/ Can I use a topic exchange to do this with fanout? My only concern is if one of my services goes down they will miss the message.
4/ Any tips from persons with experience?
A few thoughts:
If your post-processors are each doing a different 'job' then it makes sense to have queues for them to consume from. If you just have a bunch of post-processors all doing the same task, then you only need to have one queue from which they can all consume messages from.
Adding messages to queues is FAST, adding queues into RabbitMQ is fast, binding the queues to exchanges is fast. The thing that will slow down your system would be the size of the messages and the number that you are likely to receive, and then how much processing actually needs to be done.
The other consideration is to do with persistence of messages, should your messages survive a restart of RabbitMQ, that is, how critical are they? if it is critical that they not be lost (which by the sounds of your question it is) then you will need to make sure they are persisted. If you look at the RabbitMQ documentation you will see that there is a significant cost in doing this.
This depends on what your system is actually doing...Topcis are good, Fanouts are good, but what your system does depends on which is applicable.
I would highly recommend reading RabbitMQ in Action it is an excellent resource and well worth the money.
Related
I know Laravel's queue drivers such as redis and beanstalkd and I read that you can increase the number of workers for beanstalkd etc. However I'm just not sure if these solutions are right for my scenario. Here's what I need;
I listen to an XML feed over a socket connection, and the data just keeps coming rapidly. forever. I get tens of XML documents in a second.
I read data from this socket line by line, and once I get to the XML closing tag, I send the buffer to another process to be parsed. I used to just encode the xml in base64, and run a separate php process for each xml. shell_exec('php parse.php' . $base64XML);
This allowed me to parse this never ending xml data quite rapidly. Sort of a manual threading. Now I'd like to utilize the same functionality with Laravel, but I wonder if there is a better way to do it. I believe Artisan::call('command') doesn't push it to the background. I could of course do a shell_exec within Laravel too, but I'd like to know if I can benefit from Beanstalkd or a similar solution.
So the real question is this: How can I set the number of queue workers for beanstalkd or redis drivers? Like I want 20 threads running at the same time. More if possible.
A slightly less important question is: How many threads is too many? If I had a very high-end dedicated server that can process the load just fine, would creating 500 threads/workers with these tools cause any problems on the code level?
Well laravel queues are just made for that.
Basicaly, you have to create a Job Class. All the heavy work you want to do on your xml document need to be here.
Then, you fetch your xml out of the socket, and as soon as you have received one document, you push it on your Queue.
Later, a queue worker will pick it up from the queue, and do the heavy work.
The advantage of that is that if you queue up documents faster than you work on them, the queue will take care of that high load moment and queue up tasks for later.
I also don't recommend you to do it without a queue (with a fork like you did). In fact, if too much documents come in, you'll create too many childs threads and overload your server. Bookkeeping these threads correctly is risky and not worth it when a simple queue with a fixed number of workers solve all these problems out of the box).
After a little more research, I found how to set the number of worker processes. I had missed that part in the documentation. Silly me. I still wonder if this supervisor tool can handle hundreds of workers for situations like mine. Hopefully someone can share their experience, but if not I'll be updating this answer once I do a performance test this week.
I tell you from experience that shell_exec() is not the ideal way to run async tasks in PHP.
Seems ok while developing, but if you have a small vps (1-2 GB ram) you could overload your server and apache/nginx/sql/something could brake while you're not around and your website could be down for hours / days.
I recommend Laravel Queues + Scheduler for these kind of things.
I'm trying to wrap my head around the message queue model and jobs that I want to implement in a PHP app:
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal.
I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Here's a diagram of what I think I should do:
Questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
Isn't it more effective and faster to also send a notification to the main worker whenever a message is sent vs polling the MQ? I assume I would the need to use gearman to notify my main worker that the MQ has a message, so it can start checking it. or if I have 300 messages per second, this would generate 300 jobs to check the MQ?
Basically how could I check the MQ as efficiently and as effectively as possible?
Suggestions or corrections to my architecture?
My suggestions basically boil down to: Keep it simple!
With that in mind my first suggestion is to drop the DispatcherWorker. From my current understanding, the sole purpose of the worker is to listen to the MAIN queue and forward messages to the different task queues. Your application should take care of enqueuing the right message onto the right queue (or topic).
Answering your questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
Yes, there is no free lunch. Of course you could adapt and optimize your worker poll rate by application usage (when more messages arrive increase poll rate) by day/week time (if your users are active at specific times), and so on. Keep in mind that engineering costs might soon be higher than unoptimized polling.
Instead, you might consider push queues (see below).
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
This sounds too complicated. Communication is unreliable, there are reliable message queues however. If you don't want to loose data, stick to the message queues and don't invent custom protocols.
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
As already mentioned, the application should enqueue your message to multiple queues as needed. This keeps things simple and in place.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
There are so many message queues and even more ways to use them. In general, if you are using poll queues you'll need to keep your workers alive by yourself. If however you are using push queues, the queue service will call an endpoint specified by you. Thus you'll just need to make sure your workers are available.
Basically how could I check the MQ as efficiently and as effectively as possible?
This depends on your business requirements and the job your workers do. What time spans are critical? Seconds, Minutes, Hours, Days? If you use workers to send emails, it shouldn't take hours, ideally a couple of seconds. Is there a difference (for the user) between polling every 3 seconds or every 15 seconds?
Solving your problem (with push queues):
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal. I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Indeed the scenario you describe is a good fit for message queues.
As you mentioned you don't want to manage the message queue itself, maybe you do not want to manage the workers either? This is where push queues pop in.
Push queues basically call your worker. For example, Amazon ElasticBeanstalk Worker Environments do the heavy lifting (polling) in the background and simply call your application with an HTTP request containing the queue message (refer to the docs for details). I have personally used the AWS push queues and have been happy with how easy they are. Note, that there are other push queue providers like Iron.io.
As you mentioned you are using PHP, there is the QPush Bundle for Symfony, which handles incoming message requests. You may have a look at the code to roll your own solution.
I would recommend a different route, and that would be to use sockets. ZMQ is an example of a socket based library already written. With sockets you can create a Q and manage what to do with messages as they come in. The machine will be in stand-by mode and use minimal resources while waiting for a message to come in.
I have implemented rabbitMQ in my current php application to handle asynchroneous jobs that are handled by workers. But my current problem is that how should i monitor and scale up or down the workers. Also, i want to add error handling in case all the workers die. I have thought of following two ways but don't know which one is the better:
At producer end, i would analyze the rabbitMQ queue size. If queue size (list of pending tasks) is more than a threshold, i would create one new worker everytime producer script executes but before that i would check the server load (using linux command uptime). If server load is less than a threshold then only new worker would be created. At consumer end (in worker.php), i would apply same method to scale up the workers and i would also check that if script is idle for a given time (i.e. there is no pending task in rabbit mq queue) then it would automatically die (to automate scaling down of workers).
Second method is to use background process or cron to monitor and scale/up down the workers. But i don't want to rely on cron (as i have very bad experiences with it) or background process because if background process crashes for some reason then there is no way to recover from it.
Please help.
I wouldn't recommend bothering to scale them down to nothing when there's no work to be done. The worker that's left (if you want to scale back to 1) will simply wait for something else to consume and it's not an expensive operation.
In terms of determining whether to scale up, I'd recommend leveraging the RabbitMQ Management HTTP API (http://hg.rabbitmq.com/rabbitmq-management/raw-file/3646dee55e02/priv/www-api/help.html). You can use the queue related aspects via a GET operation to get information about queues, including how many entries are currently waiting to be processed.
With that info, you can decide to scale if it either hits a certain threshold, or keeps increasing with every check for a certain amount of time, or something similar. This can be done from the consumer side.
In terms of error handling, I would recommend encapsulating the RabbitMQ connection aspect of your workers such that if a RabbitMQ exception occurs the connection is re-established from scratch and continues.
If it's a more serious type of exception that isn't RabbitMQ-related, you may need to catch it at such a level where the worker basically spawns a new worker before it dies. Then of course there are other types of exceptions (out of memory conditions, for example), where it really isn't feasible to try to continue and your program should just completely die.
It is very difficult to answer your question with any degree of accuracy since there are many aspects of the context which are not included.
How long do the tasks take to execute?
Why do you want to scale up/down? Why don't you have threads waiting for load in the first place?
That being said, coming from the world of Erland and functional programming (which is the language used to power RabbitMQ) I would like to suggest the concept of a SUPERVISOR thread. This thread would have the following responsibilities:
Spawn threads depending on the load/qty of requests
Discard threads depending on the load/qty of requests
Monitor the children threads and re-launch them as required reprocessing the same messages if necessary or discarding them
The Supervisor thread should be as easy as possible and should be built in such a way that it simply loops, sleeps and checks if all the threads that need to be alive actually are - it can then check the load and spawn up or kill off the workers as needed. Or in other words, spawn more and/or not-spawn depending on your needs.
You could easily use an exchange to send messages to both the supervisor and the worker queues where the supervisor would then be able to keep a record/count of the messages in the queue without having to write polling code to the server, it would simply listen to it's own queue. You can increase/dec the counter from the supervisor thread and manage everything from there.
Hope this helps.
See: http://docs.dotcloud.com/guides/daemons/
Regretfully I don't program in PHP and therefore cannot give you PHP-specific assistance, this is however the programming pattern that I recommend that you use. If PHP doesn't allow multi-threaded programming and/or threads then I would highly recommend that you use a language that does since you will not be able to scale and use the full power of the local machine unless you use multiple threads. As for the supervisor crashing, if you keep minimal work in the supervisor and delegate all responsibilities to children threads then the risk of a supervisor crash is minimal.
Perhaps this will help:
Philosophy:
http://soapatterns.org/design_patterns/service_agent
PHP-specific:
http://www.quora.com/PHP-programming-language-1/Is-there-an-actor-framework-for-php
I am considering building a site using php, but there are several aspects of it that would perform far, far better if made in node.js. At the same time, large portions of of the site need to remain in PHP. This is because a lot of functionality is already developed in PHP, and redeveloping, testing, and so forth would be too large of an undertaking, and quite frankly, those parts of the site run perfectly fine in PHP.
I am considering rebuilding the sections in node.js that would benefit from running most in node.js, then having PHP pass the request to node.js using Gearman. This way, I scan scale out by launching more workers and have gearman handle the load distribution.
Our site gets a lot of traffic, and I am concerned if gearman can handle this load. I wan't to keep this question productive, so let's focus largely on the following addressable points:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Would this run better if I just passed the requests to node.js using CURL, and if so, does node.js provide any way to distribute the load over multiple instances of a given script?
Can gearman be configured in a way that there is no single point of failure?
What are some issues that you guys can see arising both in terms of development and scaling?
I am addressing these wide range of points so anyone viewing this post can collect a wide range of information in one place regarding matters that strongly affect each other.
Of course I will test all of this, but I want to collect as much information as possible before potentially undertaking something like this.
Edit: A large reason I am using gearman is not because of it's non-blocking structure, but because of it's sheer speed.
I can only speak to your questions on Gearman:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Short: Yes
Long: Everything has its limit. If your job payloads are inordinately large you may run into issues. Gearman stores its queue in memory.. so if your payloads exceed the amount of memory available to Gearman you'll run into problems.
Can gearman be configured in a way that there is no single point of failure?
Gearman has a plugin/extension/component available to use MySQL as a persistence store. That way, if Gearman or the machine itself goes down you can bring it right back up where it left off. Multiple worker-servers can help keep things going if other workers go down.
Node has a cluster module that can do basic load balancing against n processes. You might find it useful.
A common architecture here in nodejs-land is to have your nodes talk http and then use some way of load balancing such as an http proxy or a service registry. I'm sure it's more or less the same elsewhere. I don't know enough about gearman to say if it'll be "good enough," but if this is the general idea then I'd imagine it would be fine. At the least, other people would be interested in hearing how it went I'm sure!
Edit: Remember, number-crunching will block node's event loop! This is somewhat obvious if you think about it, but definitely something to keep in mind.
I have a web application that currently sends emails. At the time my web applications sends emails (sending of emails is based on user actions - not automatic), it has to run other processes like zipping up files.
I am trying to make my application "future proof" - so when there are a large number of users I don't want the server strained, so i thought that putting the emails that need to be sent and the files that need to be zipped in a queue. Put them in table and then use a cron job to check every second and execute them (x rows at a time).
Is the above a good idea? Or is there a better approach? I really need help to get this done properly to save myself headaches later on :)
Thanks all
It's a good approach, but the most important thing you can do right now is have a clear interface for queuing up the messages, and one for consuming the queue. Don't make the usage on either end hard-coded to a DB.
Later on, if this becomes a bottleneck, you may want the mail sending to be done from a different machine which may not even have access to the DB, so this tiny investment up front will give you options later.
One aspect you might have ignored is the zipping speed you are using, it might be in your best interest to use a lighter compression level in your zip process as that can make large improvements in zip time (easily double) which can add up to a lot when you get into the realm of multiple users.
Even better if you make the zipping intelligent and use no compression when you're zipping large already compressed files (MP3, ZIP, DOCX, XLSX, JPG, GIF, etc) and using high compression when you have simple text files (TXT, XML, DOC, XLS, etc) as they will zip very quickly even with heavy compression.
An important point might be that rather than having a cron job run once every second, have a always-running daemon that's automatically restarted on exit - or something like that.
One reason is, just as you describe it yourself, if a lot of users request emails to be sent out and the queue builds up, one cronjob won't have time to finnish before the ext one stats, and you risk having your system flooded with processes.
Is the above a good idea? yes
could there be a better solution to handle millions of users down the road? possibly.. but thats not what is important. what is important is that you have build in the layer of abstraction. If some day down the road you have crazy traffic and your cron queue isnt keeping up you can replace the functionality of that layer without changing any of the code which uses it.
Hmm. I don't really like the idea of cron running something every second. That seems like way too often. If your application really needs to be that responsive, then I think you should keep it synchronous. That is, keep the processing in your web application and look for other ways to keep the server strain level down.
If you can wait longer between checking, then it would be better to have your cron job check the queue for 1 item at a time. If there is one, process it, and then check again for the next one without exiting. If there isn't one, exit and don't try again for five minutes or so.
However, all that being said, any decent Mail Transfer Agent (sendmail, postfix, Exchange) will have queuing built-in. It will probably do a better job than you could of making sure delivery occurs when the unexpected happens. There's a lot to think about in processing queued e-mail. I generally prefer to hand-off outbound e-mail to an MTA as early in the process as I can.
--
bmb
Build in something that does distributed queuing. When you scale the volume you can scale the different layers of your tier where a bottleneck may come up.
Is there a reason to run the cron every second? Is the volume that high? I would say do your best to keep it an n-tier implementation because you will swap things in and out and refactor bits as they fight for your attention.
Try not to build anything you design for a few weeks. Often other scenarios will come to you then before things get locked in.
Good approach. Some refinements:
Don't use a cron job, instead query on a timer.
Include a state flag to keep multiple readers/writers sorted.
Reader should drain the queue - don't block until queue read is empty.
Keep it simple. Put complexity and subtlety into the writer/reader conversation.
In my experience this will scale nicely.