Running PHP cronjobs in a cloud based environment

Running PHP cronjobs in a cloud based environment - php

I'm looking for PHP solutions for running cronjobs over multiple servers and guaranteeing that only a single server runs these cronjobs automatically. We have cronjobs that we need to run only once, like daily digest emails or weekly reports.
Right now, we have a "master" server which has the crontab installed, and multiple "normal" servers which only have apache installed on them. The issue is that if the "master" server fails, nobody will run the cronjob anymore. It also means we need to keep track on which server is the master and it's creating some scaling issues for us.
Are there any ready-made php solutions for running unique tasks by multiple servers?
I have looked at gearman (http://www.slideshare.net/felixdv/high-gear-php-with-gearman), but it's a little too complex . I just need to guarantee that only one server out of the farm runs the cronjobs.

I build my concurrency and 'running' checks in the scripts themselves.
Update a 'lock' in the database or in memcached that states the last execution time.
If that lock is present, bail in the other copies of the script... unless the lock is too old.
If the lock has been sitting around > max_execution, the script failed or ran too long and never unlocked. Email yourself on that condition.
Remember to unset the lock at script close.

I don't know if you want a free solution or paid one..
Anyway the only solution which I can think of right now is TWS Tivoli Workload Scheduler which is provided by IBM .
As far as I know it works both on Windows or Linux/Unix machines.
http://www-01.ibm.com/software/tivoli/products/scheduler/

Related

Problem with notification scheduling in WEB Applications

Context
I'm currently implementing a feature to schedule notifications for a specific period through a web form using PHP and Firebase.
To send the notification I use Firebase and it sends notifications to Android/Ios.
To schedule the notification I use the AT linux service, as it seems to suit better than cron, as cron runs at certain frequencies and AT does not, it runs at a specific time.
man page about the AT: man page AT
Sample code
/usr/bin/php `send_notification.php` | at 2021-07-11 15:40
This will create a file on linux that will run in the period 2021-07-11 15:40 only once.
Problems
The AT service, like CRON, creates files inside a directory on the operating system that represent the jobs.
1 - If a machine on AWS is scaled, jobs would likely be duplicated and consequently send notifications more than once. (Note: I don't know much about machine scaling, but I believe it should happen)
2 - And if the machine is in downtime due to the inclusion of some functionality or something like that, I believe that the way it is currently the job would not be executed.
3 - Another problem, but not the main one, would be if I was using a docker container. As Ubuntu + PHP are inside the container, the job files would probably be lost if I restarted the container, so in this case I believe that a solution would be to use volume, but that would not be my problem now, as currently the application uses only one machine on AWS EB with the PHP image.
Doubts
Is there any solution I can apply to solve this duplicate job problem using PHP?
Is the approach using AT the most suitable? I see a lot of people talking to use CRON, but CRON will run the job several times and for me that's not what I'm looking for.

I think you need a place where scheduled and finished notifications will be persisted, independently on what you are using, cron or at.
If I had such a task, I would stay with a solution like this: run special script, "scheduler.php" each 1 (or more, e.g. 5) mins by cron, which will check some log file(or remote database in case of several machines) and look if there are any new lines. If new line present and it contains timestamp in the past and status "sceduled", than script will lock it and run your "sender.php". After that it will mark the line as "done". Each line in a storage should contain a timestamp to run and one of three statuses "scheduled", "running" and "done".
With such approach you could plan new notifications by adding a line with needed time and status "scheduled" to the storage. Note, that there can be a little delay between scheduled time and actual notification depending on the cron interval, but I suppose it is not critical.
This will allow you to run any number of crons on different machines and guarantee that each job will be done once.
Important: if you will adopt this scheme, be sure that your scheduler.php reads and updates a storage in a single atomic operation, to prevent race conditions between several crons. File locks, or "select for update" will do.

Cron Tasks on load balanced web servers

I'm looking for better solution to handling our cron tasks in a load balanced environment.
Currently have:
PHP application running on 3 CentOS servers behind a load balancer.
Tasks that need to be run periodically but only on a single machine at a time.
Good old cron set up to run those tasks on the first server.
Problems if the first server is out of play for whatever reason.
Looking for:
Something more robust and de-centralized.
Load balancing the tasks so multiple tasks would run only once but on random/different servers to spread the load.
Preventing not having the tasks run when the first server goes down.
Being able to manage tasks and see aggregate reports ideally using a web interface.
Notifications if anything goes wrong.
The solution doesn't need to be implemented in PHP but it would be nice as it would allow us to easily tweak it if needed.
I have found two projects that look promissing. GNUBatch and Job Scheduler. Will most likely further test both but I wonder if someone has better solution for the above.
Thanks.

You can use this small library that uses redis to create a temporary timed lock:
https://github.com/AlexDisler/MutexLock
The servers should be identical and have the same cron configuration. The server that will be first to create the lock will also execute the task. The other servers will see the lock and exit without executing anything.
For example, in the php file that executes the scheduled task:
MutexLock\Lock::init([
'host' => $redisHost,
'port' => $redisPort
]);
// check if a lock was already created,
// if it was, it means that another server is already executing this task
if (!MutexLock\Lock::set($lockKeyName, $lockTimeInSeconds)) {
return;
}
// if no lock was created, execute the scheduled task
scheduledTaskThatRunsOnlyOnce();
To run the tasks in a de-centralized way and spread the load, take a look at: https://github.com/chrisboulton/php-resque
It's a php port of the ruby version of resque and it stores the data in the same exact format so you can use https://github.com/resque/resque-web or http://resqueboard.kamisama.me/ to monitor the workers and see reports

Assuming you have a database available not hosted on one of those 3 servers;
Write a "wrapper" script that goes in cron, and takes the program you're running as its argument. The very first thing it does is connect to the remote database, and check when the last time an entry was inserted into a table (created for this wrapper). If the last insertion time is greater than when it was supposed to run, then insert a new record into the table with the current time, and execute the wrapper's argument (your cron job).
Cron up the wrapper on each server, each set X minutes behind the other (server A runs at the top of the hour, server B runs at 5 minutes, C at 10 minutes, etc).
The first server will always execute the cron first, so the other two servers never will. If the first server goes down, the second server will see it hasn't ran, and will run it.
If you also record in the table which server it was that executed the job, you'll have a log of when/where the script was executed.

Wouldn't this be an ideal situation for using a message / task queue?

I ran into the same problem but came up with this litte repository:
https://github.com/incapption/LoadBalancedCronTask

Scaling cronjobs over multiple servers

right now, we have a single server with a cronjob tab that sends out daily emails. We would like to scale that server. The application is standard zend framework application deployed on centos server in amazon cloud.
We already took care of the load balancing, content management and managing deployment. However, the cronjob is still an issue for us, as we need to grantee that some jobs are performed only once.
For example, the daily emails cronjob must only be executed once by a single server. I'm looking for the best method to grantee only one server will execute it only once.
I'm thinking about 2 solutions, but i was wondering if someone else had the same issue.
Make one of the servers "master", who only sends out the daily emails. That will be an issue, if the server malfunction, and generally we don't want to have a "special" server. It would also means we will need to keep track which server is master.
Have a queue of schedule tasks to be performed. Each server open that queue and sees which tasks needed to be performed. The first server who "grab" the task, will preform the task and mark it as done. I was looking at amazon simple queuing service as a solution for the queue.
Both these solutions have advantages and disadvantages, and i was wondering if someone thought about someone else that might help us here.

When you need to scale out cron jobs, you are better off using a job manager like Gearman

Beanstalkd could also be an option for you.

I had the same problem. What I did was dead simple.
I spun up the cheapest EC2 instance on AWS.
I created the cronjob(s) only on this server.
The cron job just run jobs that only makes a simple request to my endpoint / api (i.e. api.mydomain.com).
On my api, i just have a route watching for these special request that will run the job I want. So basically, all I'm doing instead of running the task using a cronjob, im running the task via a http request.
I hope that makes sense! Now it doesn't matter how many servers you have, it will just scale! Also, your cronjob server's only function is to run dead simple jobs to send a request, nothing more.

cron jobs or PHP scheduler

I am using MYSQL as my database and PHP as my programming language.I wanted to run a cron job which would run until the current system date matches the "deadline(date)" column in my database table called "PROJECT".Once the dates are same an update query has to run which would change the status(field of project table) from "open" to "close".
I am not really sure if cron jobs are the best way or I could use triggers or may be something else.Also I am using Apache as my web server and my OS is windows vista.
Also which is the best way to do it? PHP scheduler or cron jobs or any other method? can anybody enlighten me?

I think your concept needs to change.
PHP cannot schedule a job, neither can MySQL. Triggers in MySQL execute when a mysql query occurs, not at a specific time. Neither
This limitation usually isn't a problem in web development. The reason is because your PHP application should control all data going in and out. Usually, this means just the HTML that displays that data, or other formats to users, or other programs.
In your case you can think about it this way. The deadline is a set date. You can treat it as data, and save it to your database. When the deadline occurs is not important, it is that the data you have sent in your database is viewed correctly.
When a request is made to your application, check if the date of the deadline is in the past, if it is, then display that the project is closed - or update that the project is closed, just before display.
There really is no reason to update data independantly of your PHP application.
Usually, the only things you want to schedule are jobs that would affect your application in terms of load, or that need to be done only once, or where concurrency or time is an issue.
In your case none of those apply.
PS: I haven't tried PHPscheduler but I can guess it isn't a true scheduler. Cron is a deamon that sleeps until a given task is due in its queue, executes the task, then sleeps till the next one is due (at least thats what it does in the current algorithm). PHP cannot do that without the sockets and fork extensions, as special setup. So PHPscheduler is most likely just checking if a date for a task has expired, on each load of a webpage (whenever PHP executes a page). This is no different then you just checking if the date on the project has expired, without the overhead of PHPScheduler.

I would always go for a cron job for anything scheduling related.
The big bonus point is that you can echo info out as well and it get's emailed to you.
You'll find once you start using cronjobs, it's hard to stop.

cron does not exist, per se, in vista, but what does exist is the standard windows scheduling manager which you can run with a command line like "php -q -f myfile.php" which will execute the php file at the given time.
you can also use a port of the cron program, there are many out there.
if it is not critical to the second, any windows scheduling application will do, just be sure to have you PHP bin path in your PATH variable for simplicity.

For Windows CRON jobs I cannot recommend PyCron enough.

While CRON and Windows Scheduled Tasks are the tried and true ways of scheduling jobs/tasks to run on a regular basis, there are use cases where having a different scheduled task in CRON/Windows can become tedious. Namely when you want to let users schedule things to run, or for instances where you prefer simplicity/maintainability/portability/etc or all of the above.
In cases where I prefer to not use CRON/Windows for scheduled tasks, I build into the application a task scheduling system. This still requires 1 CRON job or Windows Task to be scheduled. The idea is to store Job details in the database (job name, job properties, last run time, run interval, anything else that is important for your implementation). You then schedule a "Master" job in CRON or Windows which handles running all of your other jobs for you. You'll need this master job to run at least as often as your shortest interval; if you want to be able to schedule jobs that run every minute the master job needs to run every minute.
You can then launch each scheduled job in the background from PHP with minimal effort (if you want). In memory constrained systems you can monitor memory usage or keep track of the PIDs (various methods) and limit to N jobs running at a given time.
I've had a great deal of success with this method, YMMV however based on your needs and your implementation.

how about PHPscheduler..R they not better than cronjobs? I think crons would be independent of the application hence would be difficult if one has to change the host..i am not really sure though..It would be great if anyone can comment on this!! Thanks!

Anatomy of a Distributed System in PHP

I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.

It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.

Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.

A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.

I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.

The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)

Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.

I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.