I have a few scripts that need to run concurrently as separate processes. My plan is to have a cron job that executes multiple instances of these scripts at a set interval. Is this a good idea? What are the pros/cons to this approach? Are there any other options I need to consider?
Bottomline: I'm trying to mimic multithreading. Any race conditions will be handled via code (e.g. setting statuses in DB, etc.). The scripts are supposed to do processing intensive tasks (e.g. creating thumbnails, etc.).
You can use forking. The startup script would load all the default configurations and initializations, then fork child processes to do the processing. It could then monitor the processes to see if they are still running.
http://php.net/manual/en/function.pcntl-fork.php
Well, if you need it as a cronjob, go ahead. If you want multiple processes, you most likely want to use pcntl_fork to create multiple instances of the same script.
Depending on how quickly you want to react to those jobs and if you're looking to do processor intensive tasks then you can also spread out that processing using a queuing system. Check out Gearman or beanstalkd with multiple workers per machine if you have multiple cores/processors.
Doesn't PHP have fork()? While that's not really multithreading, it is a basic way of co-routines.
One con of using cron is that it will execute a copy of your script at the interval you set regardless of how many script processes are already running. This means the scripts need a way to communicate with each other so that a maximum of N scripts are kept running concurrently (excess scripts can just exit immediately).
An alternative to cron could be supervisord which will execute a configurable number of scripts and monitor each one so any that exit are respawned.
Related
Running an infinite loop in cron job.Suppose, i have written a php based script to run on my server computer using cron job, and i want to use infinite loop in that php script.Any ideas for running an infinite loop in cron job.
Infinite looping applications are usually called daemons. They are system services that offer some kind of constant processing and/or the readiness to accept some potential incoming processing activities.
Gearman is a system daemon you can install than can handle various tasks you give it. It's a complex tools that allows many things but it could be used to implement your necessities.
PHP::Gearman is a Gearman client that talks to the Gearman daemon and sends tasks to the daemon specifying the conditions under which the task must be executed.
The limitations that #Jeffrey emphasized about PHP are true because PHP was designed as a share nothing architecture (one page load equals one script execution - each page load works under its own data context).
Perhaps System Daemon (a pear package) may assist in overcoming some or all of the limitations mentioned above. I haven't used it so I can't tell you much more about it but it's as good a place to start as any.
My app takes a loooong list of urls, and split it in X (where X = $threads) so then I can start a thread.php and calculate the urls for it. Then it does GET and POST request to retrieve data
I am using this:
for($x=1;$x<=$threads;$x++){
$pid[] = exec("/path/bin/php thread.php <options> > /dev/null & echo \$!");
}
For "threading" (I know its not really threading, is it forking or what?), I save the pids into a file for later checking if N thread is running and to stop them.
Now I want to move out from php, I was thinking about using python because I'd like to learn more about it.
How can I achieve this kind of "threading" with python? (or ruby)
Or is there a better way to launch multiple background threads in python or ruby that runs in parallel (at the same time)?
The threads doesn't need to communicate between each other or with a main thread, they are independent, they do http request and interact with a mysql db, they may need to access/modify the same table entries (I haven't tought about this or how I will solve it yet).
The app works with "projects", each project has a "max threads" variable and I use a web interface to control it (so I could still use php for the interface [starting/stopping threads] in the new app).
I wanted to use
from threading import Thread
in python, but I've been told those threads wont run in parallel but once at a time.
The app is intended to run on linux web servers.
Any suggestion will be appreciated.
For Python 2.6+, consider the multiprocessing module:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows
For Python 2.5, the same functionality is available via pyprocessing.
In addition to the example at the links above, here are some additional links to get you started:
multiprocessing Basics
Communication between processes with multiprocessing
You don't want threading. You want a work queue like Gearman that you can send jobs to asynchronously.
It's worth noting that this is a cross-platform, cross-language solution. There are bindings for many languages (including Python and PHP) provided officially, and many more unofficially with a bit of work with Google.
The original intent is effectively load balancing, but it works just as well with only one machine. Basically, you can create one or more Workers that listen for Jobs. You can control the number of Workers and the types of Jobs they can listen for.
If you insert five Jobs into the queue at the same time, and there happen to be five Workers waiting, each Worker will be handed one of the Jobs. If there are more Jobs than Workers, the Jobs get handled sequentially. Your Client (the thing that submits Jobs) can either wait for all of the Jobs it's created to complete, or it can simply place them in the queue and continue on.
I'm building a system that watches a queue and activates a set of tasks on a regular interval.
I'm interested in running multiple instances of my processing "bots" based on how many items are in the queue. So if there are 5 items I'll run two bots and if their are 10 I'll run four.
I know how to run multiple instances from CLI (manually), but how would I do this as a function of my application? And how would I properly track the creation and destruction of these bots?
It seems like cron (*nix) or task scheduler (windows) would be what you need.
http://en.wikipedia.org/wiki/Cron
http://msdn.microsoft.com/en-us/library/aa383614%28VS.85%29.aspx
These can run a PHP script that determine how many "bots" need run, calculations, etc. Anything PHP is capable of.
Also, for running the multiple bots in the background (after the main controller script has finished executing) you may want to look at PHP process forking.
You might also want to look at gearman ( http://gearman.org/ )
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.
I'm working on a PHP web interface that will receive huge traffic. Some insert/update requests will contain images that will have to be resized to some common sizes to speed up their further retrieval.
One way to do it is probably to set up some asynchronous queue on the server. Eg. set up a table in a db with a tasks queue that would be populated by PHP requests and let some other process on the server watch the table and process any waiting tasks. How would you do that? What would be the proper environment for that long running process? Java, or maybe something lighter would do?
If what you're doing is really high volume then what you're looking for is something like beanstalkd. It is a distributed work queue processor. You just put a job on the queue and then forget about it.
Of course then you need something at the other end reading the queue and processing the work. There are multiple ways of doing this.
The easiest is probably to have a cron job that runs sufficiently often to read the work queue and process the requests. Alternatively you can use some kind of persistent daemon process that is woken up by work becoming available.
The advantage of this kind of approach is you can tailor the number of workers to how much work needs to get done and beanstalkd handles distributed prorcessing (in the sense that the listners can be on different machines).
You may set a cron task that would check the queue table. The script that handles actions waiting in the queue can be written e.g. in php so you don't have to change implementation language.
I use Perl for long running process in combination with beanstalkd. The nice thing is that the Beanstalkd client for Perl has a blocking reserve method. This way it uses almost no CPU time when there is nothing to do. But when it has to do its job, it will automatically start processing. Very efficient.
You would want to create a daemon which would "sleep" for a period of time and then check the database for items to process. Once it found items to process, it would process them and then check again as soon as it was done, if no more, then sleep.
You can create daemon's in any language, including PHP.
Alternatively, you can just have PHP execute a script and continue on. So that PHP won't wait for the script to finish before continuing, execute it in the background.
exec("nohup /usr/bin/php -f /path/to/script/script.php > /dev/null 2>&1 &");
Although you have to be careful with that since you could end up having too many processes running in the background since there is no queueing.
You could use a service like IronWorker to do image processing in the background and take the load off your servers. Since it's a service, you won't need to manage anything or set anything else up and it will scale with you as you grow so if you can do one image with it, you can scale to millions of images with zero effort.
Here's an article on how to do a bunch of image processing transformations:
http://dev.iron.io/solutions/image-processing/
The examples there are in Ruby, but you could do the same stuff with PHP pretty easily.