Scaling cronjobs over multiple servers

Scaling cronjobs over multiple servers - php

right now, we have a single server with a cronjob tab that sends out daily emails. We would like to scale that server. The application is standard zend framework application deployed on centos server in amazon cloud.
We already took care of the load balancing, content management and managing deployment. However, the cronjob is still an issue for us, as we need to grantee that some jobs are performed only once.
For example, the daily emails cronjob must only be executed once by a single server. I'm looking for the best method to grantee only one server will execute it only once.
I'm thinking about 2 solutions, but i was wondering if someone else had the same issue.
Make one of the servers "master", who only sends out the daily emails. That will be an issue, if the server malfunction, and generally we don't want to have a "special" server. It would also means we will need to keep track which server is master.
Have a queue of schedule tasks to be performed. Each server open that queue and sees which tasks needed to be performed. The first server who "grab" the task, will preform the task and mark it as done. I was looking at amazon simple queuing service as a solution for the queue.
Both these solutions have advantages and disadvantages, and i was wondering if someone thought about someone else that might help us here.

When you need to scale out cron jobs, you are better off using a job manager like Gearman

Beanstalkd could also be an option for you.

I had the same problem. What I did was dead simple.
I spun up the cheapest EC2 instance on AWS.
I created the cronjob(s) only on this server.
The cron job just run jobs that only makes a simple request to my endpoint / api (i.e. api.mydomain.com).
On my api, i just have a route watching for these special request that will run the job I want. So basically, all I'm doing instead of running the task using a cronjob, im running the task via a http request.
I hope that makes sense! Now it doesn't matter how many servers you have, it will just scale! Also, your cronjob server's only function is to run dead simple jobs to send a request, nothing more.

Related

Executing scripts in background permanently on webserver

To extend the request limits I want to fetch data from an API endpoint and provide them to my users from a third party hosting platform. They usually support php so I was thinking of using it. The data should update like once a minute or every two minutes. The fetching process itself could be as simple as possible, e.g. like this:
$json = file_get_contents('abc.com/xyz');
file_put_contents('example.json', $json);
Like this an endpoint would be fetched and written into a local file. But to repeat this step continuously and keep the data updated this script would be needed to run permanently or executed frequently. The only way I found was to use cron jobs for that issue but would that be recommendable to use to keep files updated? Or are there way better methods to do this?
I know that there are better setups to solve that issue like handling it with node.js but I consider using a platform like this so I only have to manage the communication between the API and the server and not between server and clients and didn’t find another way to do so but I‘m open to other suggestions!

While it can be done differently (like with node.js you mentioned or other methods), I believe that a system cron job to be run every X minutes (depending on how long it takes for the API to respond) will suffice and keep things simple.
Provided of course that you are able to set-up system cron jobs on your webserver.

SendGrid for PHP is slow. Are non-blocking requests possible?

We are currently developing a mobile app for iOS and Android. For this, we need a stable webservices.
Requirements: - Based on PHP and MySQL, must be blazing fast, must be scalable
I've created a custom-coded simple webservices with multiple endpoints to allow passing data from the app to our database, and vice versa.
My Question:
our average response time with my custom coded solution is below 100ms (measured using newrelic) for normal requests (say, updating a DB field, or performing INSERT INTO). This is without any load however (below 100 users daily). When we are creating outbound requests (specifically, sending E-Mail using SendGrid PHP-Framework) we are seeing a response time of > 1000ms. It appears that the request is "waiting" for a response from Sendgrid. Is it possible to tell the script not to "wait for a response"? This is not really ideal. My idea was to store all "pending" requests to a separate table, and then using a cron to run through all "pending" requests and mark them as "completed". Is this a viable solution? And will one cron each minute be enough for processing requests (possible delay of 1min for each E-Mail)?
As always, any replies or suggestions are very appreciated. Thanks in advance!

To answer the first part of your question: Yes you can make asynchronous requests with PHP, and even ignore the service's response. However, as you correctly say it's not a super great solution.
Asynchronous Requests
This excellent blog post on PHP Asynchronous Requests by Segment.io comes to several conclusions:
You can open a socket and write to it, as described by this Stack Overflow Topic - However, it seems that this is actually blocking and fairly slow (300ms in their tests).
You can write to a log file and then process it in another way (essentially a queue, like you describe) - However, this requires another process to read the log and process it. Using the file system can be slow, and shared files can cause all sorts of problems.
You can fork a cURL request - However, this means you aren't waiting for a response, so if SendGrid (or some other service) responds with an error, you can't catch it and react.
Opinion Land
We're now entering semi-opinion land, but queues as you describe (such as a mySQL one with a cron job, or a text file, or something else) tend to be very scalable as you can throw workers at the queue if you need it to process faster. These can be outside your user facing system (and therefor not share resources).
Queues
With a queue, you'd have a separate service that would be responsible for sending an email with SendGrid (e.g.). It would pull tasks off a queue (e.g. "send an email to Nick")and then execute on it.
There are several ways to implement queues that you can process.
You can write your own - As you seem to want to stay on PHP/mySQL, if you do this you'll need to take into account a bunch of queueing problems and weird edge cases. However, you'll have absolute control and for a simple application maybe this will work.
You can implement a self hosted task queue - Celery is meant to be a distributed task queue, øMQ (ZeroMQ) and RabbitMQ can also be used as Task Queues. These are meant to be fast and distributed and have had a lot of thought put into them. You'd need to benchmark them in your system to see if they speed it up. It'd also mean you have to host additional pieces yourself. This however, is likely to be the fastest solution from a communication standpoint.
You can pass things off to a hosted task queue - IronMQ and Amazon SQS are both cool hosted solutions which means you wouldn't need to dedicate resources to them, additionally with IronWorkers (e.g.) you could have the other service taken care of. However, since you're trying to optimize a request to an external service, this probably isn't the solution in this scenario.
Queueing Emails
On the topic of queuing emails (specifically), this is something common to email senders. Like with everything else it means you can have better reliability (because if a service down the line fails you can keep it in the queue and retry).
With email however, there's some specific services out there for queueing messages. These are SMTP Servers. Theoretically you can setup a server like sendmail and then set SendGrid as your "smarthost" or relay and have the server send to SendGrid. It then queues and deals with service interruptions and sends mail with little additional code. However, SMTP servers are pains to deal with, even if they're just forwarding messages. Additionally, SMTP is even slower than HTTP to establish a connection and therefor probably not what you want, but it's good to know.

Another possible solution if you control your own server environment that will speed up your email sending and your application is to install a mail server such as Postfix locally. You then configure Postfix to use your Sendgrid credentials, so any email sent will go from your server to sendgrid.
This is not a PHP solution, but removes the need for writing your own customer solution. If you set Postfix as the default mail server. You can then just use the php mail() function to send email.
https://sendgrid.com/docs/Integrate/Mail_Servers/postfix.html

Distributed video encoding - Gearman vs Beanstalkd

Im looking to build a distributed video encoding cluster of a few dozen machines. Ive never worked with a messaging queue before, but the 2 that I started playing around with were Gearman and Beanstalkd.
Beanstalk seems to be a lot simpler and easier to use than Gearman, but its not as feature rich as.
One thing I don't understand is... how do you spawn new workers on all the servers? I plan to use php. Is it as simple as running worker.php in CLI with "&" and just have it sit there waiting for work?
I noticed gearman doesn't actually kill the process after a job is done, but Beanstalk does, so I have to restart the script after every job, on every server.
Currently Im more inclined to use Beanstalk, the general flow of things I planned was:
Run a minutely cron on each server that checks if there are pre-defined amount of workers running. If its less than supposed to be, spawn new worker processes. Each process will take roughly 2-30 minutes.
Maybe I have a flaw in my logic here? Let me know what would be a "better" or "proper" way of doing this?

Terminology I will use just to try and be clear...
There is the concept of a producer and a consumer. The producer generates jobs that are put on a queue (i.e. the beanstalk service) that is then read by a consumer.
There are multiple ways to write a consumer. You can either every x time frame via a cron job run the task or just have a consumer running in a while 1 loop via php (or what have you).
Where to install the service is really dependent on what you are going after. For me I normally install the service either on a consumer(s) or on its separate box (with sometimes the latter being overkill depending on your needs).
If you want durability on the queue side then you should use Beanstalk's binlog parameter (-b ). If something happens to your beanstalk service this will allow you to restart with minimal loss of data in the queues (if not no information). Durability on the producer side can come from having multiple queues to try against.

Threads in PHP?

I am creating a web application using zend, here I create an interface from where user-A can send email to more than one user(s) & it works excellent but it slow the execution time because of which user-A wait too much for the "acknowledged response" ( which will show after the emails have sent. )
In Java there are "Threads" by which we can perform that task (send emails) & it does not slow the rest application.
Is there any technique in PHP/Zend just like in Java by which we can divide our tasks which could take much time eg: sending emails.

EDIT (thanks #Efazati, there seems to be new development in this direction)
http://php.net/manual/en/book.pthreads.php
Caution: (from here on the bottom):
pthreads was, and is, an experiment with pretty good results. Any of its limitations or features may change at any time; [...]
/EDIT
No threads in PHP!
The workaround is to store jobs in a queue (say rows in a table with the emails) and have a cronjob call your php script at a given interval (say 2 minutes) and poll for jobs. When jobs present fetch a few (depending on your php's install timeout) and send emails.
The main idea to defer execution:
main script adds jobs in the queue
cron script sends them in tiny slices
Gotchas:
make sure u don't send an email without deleting from queue (worst case would be if a user rescieves some spam at 2 mins interval ...)
make sure you don't delete a job without executing it first ...
handle bouncing email using a score algorithm

You could look into using multiple processes, such as with fork. The communication between them wouldn't be as simple as with threads (but then, it won't come with all of its pitfalls either), but if you're just sending emails, it might not be necessary to communicate much, if at all.

Watch out for doing forks on an Apache process. You may get some behaviors that you are not expecting. If you are looking to do any kind of asynchronous execution it should be via some kind of queuing mechanism. Gearman is one. Zend Server Job Queue is another. I have some demo code at Do you queue? Introduction to the Zend Server Job Queue. Cron can be used, but you'll have the problem of depending on your cron scheduler to run tasks whereas asynchronous computing often needs to be run immediately. Using a queuing system allows you to do that without threading.

There is a Threading extension being developed based on PThreads that looks promising at https://github.com/krakjoe/pthreads

There is pcntl, which allows you to create sub-processes, but php doesn't work very well for this kind of architecture. You're probably better off creating a long-running script (a daemon) and spawning multiple of them.

As of PHP there are no threads in it. However for php, you can have a look at this roundabout way
http://www.alternateinterior.com/2007/05/multi-threading-strategies-in-php.html

You may want to use a queue system for your email sending and send the email from another system which supports threads. PHP is just a tool and you should the tool that is best fitted for the job.

PHP doesn't include threading as part of the language, there are some methods that can emulate it but they aren't foolproof.
This Google search shows a few potential workarounds

Anatomy of a Distributed System in PHP

I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.

It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.

Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.

A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.

I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.

The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)

Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.

I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.