The scenario is - I am building a message queue model using RabbitMQ and phpamqplib. This model will have 15 programs each program will consume a message from a queue and publish a message to another queue. All these queues are different (i.e. around 30 queues). But I want to use only 2 connections across all these programs one for publishing and another one for consuming. I don't want to create broker connections in each of the program. I am not able to understand how to do it? Any help? Thanks in advance.
If you want to use 2 connections then 15 producers and consumers should be part of a single process and run as threads. In addition two threads one for consuming and other for publishing.
The consumer thread consumes messages and pushes them to remaining worker thread pool.
Once the worker threads have completed their work, response is pushed to an internal storage inside publisher,which in turn pops the response onto rabbit queues.
. Few points to keep in mind are:
Throughput: Number of consumers and producers is decided on basis of throughput you want to achieve for your application.
Scalability, if you have fixed number of consumers and producers then you might be able to scale your application to a limit.
Flow control: number of consumers can be crucial in avoiding connection based flow control.
Internal message caching by consumer thread (Qos). Set a well defined QOs value as per the throughout desired.
Also explore if multi-threading is supported by amqp library you desire to use. If yes then you could share the connection across threads.
Related
I have a Laravel app (on Forge) that's posting messages to SQS. I then have another box on Forge which is running Supervisor with queue workers that are consuming the messages from SQS.
Right now, I just have one daemon worker processing a particular tube of data from SQS. When messages come up, they do take some time to process - anywhere from 30 to 60 seconds. The memory usage on the box is fine, but the CPU spikes almost instantly and then everything seems to get slower.
Is there any way to handle this? Should I instead dispatch many smaller jobs (which can be consumed by multiple workers) rather than one large job which can't be split amongst workers?
Also, I noted that Supervisor is only using one of my two cores. Any way to have it use both?
Having memory intensive applications is manageable as long as scaling is provided, but CPU spikes is something that is hard to manage since it happens within one core, and if that happens, sometimes your servers might even get sandboxed.
To answer your question, I see two possible ways to handle your problem.
Concurrent Programming. Have it as it is, and see whether the larger task can be parallelized. (see this). If this is supported, then parallelize the code to ensure that each core handles a specific part of your large task. Finally, gather the results into one coordinating core and assemble the final result. (additionally: This can be efficiently done is GPU programming is considered)
Dispatch Smaller Jobs (as given in the question): This is a good approach if you can manage multiple workers working on smaller tasks and finally there is a mechanism to coordinate everything together. This could be arranged as a Master-Slave setting. This would make everything easy (because parallelizing a problem is a bit hard), but you need to coordinate everything together.
I'm trying to wrap my head around the message queue model and jobs that I want to implement in a PHP app:
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal.
I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Here's a diagram of what I think I should do:
Questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
Isn't it more effective and faster to also send a notification to the main worker whenever a message is sent vs polling the MQ? I assume I would the need to use gearman to notify my main worker that the MQ has a message, so it can start checking it. or if I have 300 messages per second, this would generate 300 jobs to check the MQ?
Basically how could I check the MQ as efficiently and as effectively as possible?
Suggestions or corrections to my architecture?
My suggestions basically boil down to: Keep it simple!
With that in mind my first suggestion is to drop the DispatcherWorker. From my current understanding, the sole purpose of the worker is to listen to the MAIN queue and forward messages to the different task queues. Your application should take care of enqueuing the right message onto the right queue (or topic).
Answering your questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
Yes, there is no free lunch. Of course you could adapt and optimize your worker poll rate by application usage (when more messages arrive increase poll rate) by day/week time (if your users are active at specific times), and so on. Keep in mind that engineering costs might soon be higher than unoptimized polling.
Instead, you might consider push queues (see below).
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
This sounds too complicated. Communication is unreliable, there are reliable message queues however. If you don't want to loose data, stick to the message queues and don't invent custom protocols.
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
As already mentioned, the application should enqueue your message to multiple queues as needed. This keeps things simple and in place.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
There are so many message queues and even more ways to use them. In general, if you are using poll queues you'll need to keep your workers alive by yourself. If however you are using push queues, the queue service will call an endpoint specified by you. Thus you'll just need to make sure your workers are available.
Basically how could I check the MQ as efficiently and as effectively as possible?
This depends on your business requirements and the job your workers do. What time spans are critical? Seconds, Minutes, Hours, Days? If you use workers to send emails, it shouldn't take hours, ideally a couple of seconds. Is there a difference (for the user) between polling every 3 seconds or every 15 seconds?
Solving your problem (with push queues):
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal. I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Indeed the scenario you describe is a good fit for message queues.
As you mentioned you don't want to manage the message queue itself, maybe you do not want to manage the workers either? This is where push queues pop in.
Push queues basically call your worker. For example, Amazon ElasticBeanstalk Worker Environments do the heavy lifting (polling) in the background and simply call your application with an HTTP request containing the queue message (refer to the docs for details). I have personally used the AWS push queues and have been happy with how easy they are. Note, that there are other push queue providers like Iron.io.
As you mentioned you are using PHP, there is the QPush Bundle for Symfony, which handles incoming message requests. You may have a look at the code to roll your own solution.
I would recommend a different route, and that would be to use sockets. ZMQ is an example of a socket based library already written. With sockets you can create a Q and manage what to do with messages as they come in. The machine will be in stand-by mode and use minimal resources while waiting for a message to come in.
I have implemented rabbitMQ in my current php application to handle asynchroneous jobs that are handled by workers. But my current problem is that how should i monitor and scale up or down the workers. Also, i want to add error handling in case all the workers die. I have thought of following two ways but don't know which one is the better:
At producer end, i would analyze the rabbitMQ queue size. If queue size (list of pending tasks) is more than a threshold, i would create one new worker everytime producer script executes but before that i would check the server load (using linux command uptime). If server load is less than a threshold then only new worker would be created. At consumer end (in worker.php), i would apply same method to scale up the workers and i would also check that if script is idle for a given time (i.e. there is no pending task in rabbit mq queue) then it would automatically die (to automate scaling down of workers).
Second method is to use background process or cron to monitor and scale/up down the workers. But i don't want to rely on cron (as i have very bad experiences with it) or background process because if background process crashes for some reason then there is no way to recover from it.
Please help.
I wouldn't recommend bothering to scale them down to nothing when there's no work to be done. The worker that's left (if you want to scale back to 1) will simply wait for something else to consume and it's not an expensive operation.
In terms of determining whether to scale up, I'd recommend leveraging the RabbitMQ Management HTTP API (http://hg.rabbitmq.com/rabbitmq-management/raw-file/3646dee55e02/priv/www-api/help.html). You can use the queue related aspects via a GET operation to get information about queues, including how many entries are currently waiting to be processed.
With that info, you can decide to scale if it either hits a certain threshold, or keeps increasing with every check for a certain amount of time, or something similar. This can be done from the consumer side.
In terms of error handling, I would recommend encapsulating the RabbitMQ connection aspect of your workers such that if a RabbitMQ exception occurs the connection is re-established from scratch and continues.
If it's a more serious type of exception that isn't RabbitMQ-related, you may need to catch it at such a level where the worker basically spawns a new worker before it dies. Then of course there are other types of exceptions (out of memory conditions, for example), where it really isn't feasible to try to continue and your program should just completely die.
It is very difficult to answer your question with any degree of accuracy since there are many aspects of the context which are not included.
How long do the tasks take to execute?
Why do you want to scale up/down? Why don't you have threads waiting for load in the first place?
That being said, coming from the world of Erland and functional programming (which is the language used to power RabbitMQ) I would like to suggest the concept of a SUPERVISOR thread. This thread would have the following responsibilities:
Spawn threads depending on the load/qty of requests
Discard threads depending on the load/qty of requests
Monitor the children threads and re-launch them as required reprocessing the same messages if necessary or discarding them
The Supervisor thread should be as easy as possible and should be built in such a way that it simply loops, sleeps and checks if all the threads that need to be alive actually are - it can then check the load and spawn up or kill off the workers as needed. Or in other words, spawn more and/or not-spawn depending on your needs.
You could easily use an exchange to send messages to both the supervisor and the worker queues where the supervisor would then be able to keep a record/count of the messages in the queue without having to write polling code to the server, it would simply listen to it's own queue. You can increase/dec the counter from the supervisor thread and manage everything from there.
Hope this helps.
See: http://docs.dotcloud.com/guides/daemons/
Regretfully I don't program in PHP and therefore cannot give you PHP-specific assistance, this is however the programming pattern that I recommend that you use. If PHP doesn't allow multi-threaded programming and/or threads then I would highly recommend that you use a language that does since you will not be able to scale and use the full power of the local machine unless you use multiple threads. As for the supervisor crashing, if you keep minimal work in the supervisor and delegate all responsibilities to children threads then the risk of a supervisor crash is minimal.
Perhaps this will help:
Philosophy:
http://soapatterns.org/design_patterns/service_agent
PHP-specific:
http://www.quora.com/PHP-programming-language-1/Is-there-an-actor-framework-for-php
I have a script which picks up messages from a queue, this does the pre-processing required for the other processes to work.
Now, these messages have to be delivered so I need to ack these messages and if one of the services listening for messages goes down then it should receive the messages it missed when it comes back online.
A couple questions:
1/ Does it make sense to have a queue for each post-processing service which is added to everytime the pre-processing runs? (So I might add to 8 different queues at the same time following each process - this will be a ton of messages (hundreds of thousands p/day).
2/ How quick is it to add messages to a queue? Is adding to 8-10 queues going to slow down my software?
3/ Can I use a topic exchange to do this with fanout? My only concern is if one of my services goes down they will miss the message.
4/ Any tips from persons with experience?
A few thoughts:
If your post-processors are each doing a different 'job' then it makes sense to have queues for them to consume from. If you just have a bunch of post-processors all doing the same task, then you only need to have one queue from which they can all consume messages from.
Adding messages to queues is FAST, adding queues into RabbitMQ is fast, binding the queues to exchanges is fast. The thing that will slow down your system would be the size of the messages and the number that you are likely to receive, and then how much processing actually needs to be done.
The other consideration is to do with persistence of messages, should your messages survive a restart of RabbitMQ, that is, how critical are they? if it is critical that they not be lost (which by the sounds of your question it is) then you will need to make sure they are persisted. If you look at the RabbitMQ documentation you will see that there is a significant cost in doing this.
This depends on what your system is actually doing...Topcis are good, Fanouts are good, but what your system does depends on which is applicable.
I would highly recommend reading RabbitMQ in Action it is an excellent resource and well worth the money.
I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.
It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.
Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.
A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.
I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.
The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)
Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.
I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0