I'm looking for a solution to add items into a queue and execute them one-by-one in a similar method to google appengine's tasks manager. Each task will be executed using a http request to a php script.
As i'm using amazon, i understood that the best practice is using the SNS service that will be responsible for receiving new tasks, adding them to a queue (Amazon's SQS service) and also inform my php worker that a new task has been pushed into the queue so he can look for it and execute it.
There are several issues with that method (like the need to limit the number of workers instances via the worker itself or just the possibility that the task won't be in the queue when we call the worker because we add the task to the queue in the same time).
I would like to hear if there are any better options or a nicer way of implementing a tasks manager. I preffer using the amazon's services but i'm open to any new suggestion, looking for the best method. Features that are missing in amazon like FIFO and priorities support would also be a nice addition.
Thanks!
Ben
I have found a good solution.
AWS Beanstalk service is apparently offering an option to define a new elastic-beanstalk instance as a "worker" or a "web server". in case you define it as a "Worker", you'll be able to attach it to a sqs queue and it will be responsible for polling the queue and performing the task (with the code you deploy to the instance).
Related
I am building a Multi-Tenant web application using Laravel/PHP that will be hosted on AWS as SaaS at the end. I have around 15-20 different background jobs that need scheduling for each tenant. The jobs need to be fired every 5 minutes as well. Thus the number of jobs which need to be fired for 100 tenants would be around 2000. I am left with 2 challenges in achieving this
Is there a cloud solution that distributes and manages the load of the scheduled jobs automatically?
If one is out there, how can we create those 15+ scheduled jobs on the fly? Is there an API available?
Looking for your assistance
Finally, I have found a solution to my problem.
We cannot scale the background jobs in the way I want. It required me to look into the solution from a completely different angle.
The ideal solution to my problem is that I should generate SQS messages (with a payload describing the tenant id, the job needs to be executed and any additional parameters) corresponding to the number of tenants on a set interval and queue it.
For example, if I have 100 tenants and I want to run "Job 1" every our, the main application will generate 100 SQS messages and queue it in a particular SQS Queue every hour. It will do the same for all 15 different jobs I have per tenant.
On the other end, a scalable AWS Lambda function listening to the SQS queue will pick up the payload and execute the intended task based on the data being carried by the payload.
But unfortunately, my expertise lies in PHP/Laravel technology which is still not in the AWS Lambda stack. Hence I figured out a workaround as follows.
I built a Docker image with my PHP/Laravel application and placed it in Amazon ECS (EC2 container service). Still, I have the AWS Lambda function in place but this time it acts as a trigger to my docker containers. The Lambda picks an SQS Message, processes the payload and spawns a Docker container on ECS based on my Docker image. I got some of the ideas from the following article to arrive at this solution.
https://aws.amazon.com/blogs/compute/better-together-amazon-ecs-and-aws-lambda/
Laravel has option to schedule Task/Jobs:
Refer: https://laravel.com/docs/6.x/scheduling
so you can keep jobs of your client in your database and than do it some like below:
Scheduling Queued Jobs
The job method may be used to schedule a queued job. This method provides a convenient way to schedule jobs without using the call method to manually create Closures to queue the job:
$schedule->job(new ClientJob)->everyFiveMinutes();
// Dispatch the job to the "clientjob" queue...
$schedule->job(new ClientJob, 'clientjob')->everyFiveMinutes();
or
Scheduling Shell Commands
The exec method may be used to issue a command to the operating system:
$schedule->exec('node /home/forge/script.js')->everyFiveMinutes();
I'll preface this by admitting slight sleep-deprivation.
The setup is as follows:
API Endpoint (Server A) receives an incoming call, and adds this to a specific queue on the RabbitMQ Server (Server B).
RabbitMQ (Server B) is simply a RabbitMQ Queue Server. Nothing more, nothing less.
Laravel Installation (Server C) is our actual Laravel install, which is meant to look for jobs on specific queues and do things with them.
We have a RabbitMQ package in the Laravel install, which allows the use of the regular Laravel Queue mechanics over a RabbitMQ connection.
The issue I've come across is that we can spawn a worker for a queue - but since we're not generating the jobs passing a $job class (the job content itself is most often a JSON array), the Laravel install has no idea what to do with the job.
So my question revolves mainly around how to approach a scenario like this. I'm thinking that using the Queue-functionality in Laravel won't do what I need it to do. Can you see an approach that I'm missing? Do I really need to spawn a daemon on a non-framework script to handle this?
Your input is much appreciated!
An alternative approach would be a listener on your Laravel application consuming the JSON messages an acting on those.
A queue listener can be created using a package such as https://github.com/bschmitt/laravel-amqp (a generic AMQP bridge for Laravel) or https://github.com/needle-project/laravel-rabbitmq (a bridge more specialised for RabbitMQ).
The queue consumer then reads the JSON payload, saves the paymload as appropriate data, then decides what jobs to dispatch as a result within the Laravel application, as handled by the https://github.com/vyuldashev/laravel-queue-rabbitmq package.
The the two applications still communicate with plain JSON, and not the Laravel-oriented JSON containing the serialised job class.
The solution is indeed to replicate the job code onto the one issuing the job. The code will not need every dependency that the job requires to actually function, as it only serializes the job from the one pushing it.
I'm trying to wrap my head around the message queue model and jobs that I want to implement in a PHP app:
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal.
I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Here's a diagram of what I think I should do:
Questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
Isn't it more effective and faster to also send a notification to the main worker whenever a message is sent vs polling the MQ? I assume I would the need to use gearman to notify my main worker that the MQ has a message, so it can start checking it. or if I have 300 messages per second, this would generate 300 jobs to check the MQ?
Basically how could I check the MQ as efficiently and as effectively as possible?
Suggestions or corrections to my architecture?
My suggestions basically boil down to: Keep it simple!
With that in mind my first suggestion is to drop the DispatcherWorker. From my current understanding, the sole purpose of the worker is to listen to the MAIN queue and forward messages to the different task queues. Your application should take care of enqueuing the right message onto the right queue (or topic).
Answering your questions:
My workers, would be written in PHP they all have to be polling the cloud queue service? that could get expensive especially when you have a lot of workers.
Yes, there is no free lunch. Of course you could adapt and optimize your worker poll rate by application usage (when more messages arrive increase poll rate) by day/week time (if your users are active at specific times), and so on. Keep in mind that engineering costs might soon be higher than unoptimized polling.
Instead, you might consider push queues (see below).
I was thinking maybe have 1 worker just for polling the queue, and if there are messages, notify the other workers that they have jobs, i just have to keep this 1 worker online using supervisord perhaps? is this polling method better than using a MQ that can notify? How should I poll the MQ, once every second or as fast as it can poll? and then increase the polling workers if I see it slowing down?
This sounds too complicated. Communication is unreliable, there are reliable message queues however. If you don't want to loose data, stick to the message queues and don't invent custom protocols.
I was also thinking of having a single queue for all the messages, then the worker monitoring that distributes the messages to other cloud MQs depending on where they need to be processed, since 1 message might need to be processed by 2 diff workers.
As already mentioned, the application should enqueue your message to multiple queues as needed. This keeps things simple and in place.
Would I still need gearman to manage my workers or can I just use supervisord to spin workers up and down?
There are so many message queues and even more ways to use them. In general, if you are using poll queues you'll need to keep your workers alive by yourself. If however you are using push queues, the queue service will call an endpoint specified by you. Thus you'll just need to make sure your workers are available.
Basically how could I check the MQ as efficiently and as effectively as possible?
This depends on your business requirements and the job your workers do. What time spans are critical? Seconds, Minutes, Hours, Days? If you use workers to send emails, it shouldn't take hours, ideally a couple of seconds. Is there a difference (for the user) between polling every 3 seconds or every 15 seconds?
Solving your problem (with push queues):
My goal is to offload messages / data that needs to be sent to multiple third party APIs, so accessing them doesnt slow down the client. So sending the data to a message queue is ideal. I considered using just Gearman to hold the MQ/Jobs, but I wanted to use a Cloud Queue service like SQS or Rackspace Cloud Queues so i wouldnt have to manage the messages.
Indeed the scenario you describe is a good fit for message queues.
As you mentioned you don't want to manage the message queue itself, maybe you do not want to manage the workers either? This is where push queues pop in.
Push queues basically call your worker. For example, Amazon ElasticBeanstalk Worker Environments do the heavy lifting (polling) in the background and simply call your application with an HTTP request containing the queue message (refer to the docs for details). I have personally used the AWS push queues and have been happy with how easy they are. Note, that there are other push queue providers like Iron.io.
As you mentioned you are using PHP, there is the QPush Bundle for Symfony, which handles incoming message requests. You may have a look at the code to roll your own solution.
I would recommend a different route, and that would be to use sockets. ZMQ is an example of a socket based library already written. With sockets you can create a Q and manage what to do with messages as they come in. The machine will be in stand-by mode and use minimal resources while waiting for a message to come in.
I see a common pattern for services that we try to develop and I wonder if there are tools / libraries out there that would help here. While the default jobs as discussed in microservice literature is from the REQUEST -> RESPONSE nature, our jobs are more or less assignments of semi permanent tasks.
Examples of such tasks
Listen on the message queue for data from source X and Y, correlate the data that comes in and store it in Z.
Keep an in-memory buffer that calculates a running average of the past 15 mins of data everytime a new data entry comes in.
Currently our services are written in PHP. Due to the perceived overhead of PHP processes and connections to the message queue we'd like a single service process to handle multiple of those jobs simultanously.
A chart that hopefully illustrated the setup that we have in our head:
Service Workers are currently deamonized PHP scripts
For the Service Registry we are looking at Zookeeper
While Zookeeper (and Curator) do loadbalancing, I did not find anything around distributing permanent jobs (that are updatable, removable, and must be reassigned when a worker dies)
Proposed responsibilities of a Job Manager
Knows about jobs
Knows about services that can do these jobs
Can assign jobs to services
Can send job updates to services
Can reassign jobs if a worker dies
Are there any libraries / tools that can tackle such problems, and can thus function as the Job Manager? Or is this all one big anti pattern and should we do it some other way?
You should have a look at Gearman.
It composes of a client which assigns the jobs, one or more workers which will pick up and execute the jobs and a server which will maintain the list of functions (services) and jobs pending. It will re-assign the jobs if a worker dies.
Your workers sound like (api-less) services itself. So, your requirements can be reformulated as:
Knows about deployed services
Knows about nodes that can host there services
Can deploy services to nodes
Can [send job updates to services] = redeploy services/invoke some API on deployed services
Can redeploy service if service or node dies
Look at Docker to deploy, run and manage isolated processes on host.
RabbitMq is simple message queue that is fairly easy to get going with.
i'm looking for use a work queue to delegate some jobs.
I know that some service like Amazon SQS or Beanstalkd are perfect for this problem.
But, in both I have to create a daemon that poll the queue every x seconds.
Is there other ways to do that with some kind of push system?
Someone has experience with SQS+SNS to call the workers?
Thanks.