I am developing a video upload site and I have ran into a dilemma: videos uploaded need to be converted into the FLV format in order to be displayed to a visitor but, if I execute the command within the script, the script will hang for about 10-15 minutes while the FFMPEG converts the video.
I had an idea to insert a record in to the database indicating the file needs to be processed, then using a cron job set to every 5 minutes to select records from the database which needs to be processed, process them, then update the database showing they have been processed. My worry about this is executing too many processes and the server crashing under the strain, so has anyone got any solutions to this or a way to better the process I have in mind?
Okay, this is now what I have in mind, so the user uploads a video and a row is inserted in to the database indicating the video needs to be processed. A cron job set to every 5 minutes checks what needs to be processed and what is being processed, say I would make a maximum of five processes at one time, so the script would check if any video needs to be processed and how many videos are being processed, if it is less then five, it updates the record indicating that it is being processed, once the video has been processed, it updates the record indicating it has been processed and the cron job starts again, any thoughts?
Gearman is a good solution for this kind of problem, it lets you instantly dispatch a job and have any number of workers (which may be on different servers) available to fulfill it.
To start with you can run a few workers on the same server, but if you start to run into load issues then you can just fire up another server with some more workers, so it's horizontally scalable.
If you're using PHP-FPM then you can make use of fastcgi_finish_request() as documented on PHP.net. FastCGI Process Manager (FPM)
fastcgi_finish_request() - special function to finish request and flush all data while continuing to do something time-consuming (video converting, stats processing etc.);
If you're not using PHP-FPM or want something more advanced then you might consider using a queue manager like Gearman which is perfectly suited to the scenario you're describing. The advantage of using Gearman over running a process with shell_exec is you can take a look at how many jobs are running / how many are left and check their statuses. You also make scaling much easier as it's now trivial to add job servers:
$worker->addServer("10.0.0.1");
I love this class (see the specific comment) in the PHP manual: http://www.php.net/manual/en/function.exec.php#88704
Basically, it lets you spin off a background process on *Nix systems. it returns a pid, which you can store in the session. When you reload the page to check on it, you simply recreate the ForkedProcess class with the saved pid, and you can check on it's status. If it's complete, the process should be done.
It doesn't allow for much error checking, but it's incredibly lightweight.
If you expect a lot of traffic you should seriously consider a dedicated server.
On a single server, you can use shell_exec along with the UNIX nohup command to get the PID of the process.
function run_in_background($Command, $Priority = 0)
{
if($Priority)
$PID = shell_exec("nohup nice -n $Priority $Command 2> /dev/null & echo $!");
else
$PID = shell_exec("nohup $Command 2> /dev/null & echo $!");
return($PID);
}
function is_process_running($PID)
{
exec("ps $PID", $ProcessState);
return(count($ProcessState) >= 2);
}
A full description of this technique is here: http://nsaunders.wordpress.com/2007/01/12/running-a-background-process-in-php/
You could perhaps put the list of PIDs in a MySQL table and then use your cron job every 5 mins to detect when a video is complete and update the relevant values in the database.
You can call ffmpeg use system and send the output to /dev/null, this will make that call return right away, effectively handling it in the background.
Spawn couple of worker processes which will consume messages from message queue like for example beanstalkd. This way you can control the number of concurrent tasks(conversions) and also don't have to pay price of spawning processes(because processes keep running in background).
I think it would be even a lot faster if you used/coded C and used Redis as your message queue. Redis has a very good c client library named Hiredis. I don't think this would be insanely difficult to accomplish.
Related
I'm creating a webservice for an Android app in PHP with MySQL. I want to continuously check whether any data is available. I haven't got any idea how to get data as a background process. How can I execute a query without any request or without calling file?
I searched and got some code like
$command = "php -d max_execution_time=50 -f myfile.php '".$param."' >/dev/null &";
exec($command);
But where should I put this code so this query will run continuously?
Yes, the ampersand trick will work. You can use something like supervisord to restart it every few hours, so that any memory leaks are dealt with. This also makes it less fragile if it were to crash or hang.
Also, you can use something like cron to run a task for 10 minutes, and then die off and wait for cron to start it again - bear in mind that with most background tasks, it doesn't matter if there's a short period the task is not running, since it will catch up. It's worth checking in each run whether the previous one is still running, and exit early if it is: that way you don't have two background tasks causing race-conditions when retrieving work from your database.
Finally you can use a job server, such as Gearman. This will allow you to send tasks to it in an asynchronous fashion, and they will be run by worker tasks (in either time or priority order). This is probably the most reliable approach, but it takes a bit more work to set up. There's a PHP module for this, but in my experience it's more of a hassle to use than Net_Gearman, which is available in PEAR.
I have a site where auctions end a varying times. I need to send an automated email to the seller and the buyer after the auction is finished to notify them of the auction ending and the results. Obviously I can't really wait for someone to load the page to run the script so is there a good way to automate this by checking the current time and comparing that to the time of the auction end and running that script?
The site is on a UNIX server so a cron job is an option, but I'm concerned that running a cron job like that will put quite a load on the server.
A cron job runs at most once per minute.
Whatever load it generates on the server really depends on the kind of script you're going to run. Btw, I'm assuming that you're using cli to run the script (rather than just doing a curl http://mysite.com.
If your script takes longer than one minute (you should monitor this), simply either:
Increase the interval time between runs or,
Use a lock file to make sure no two instances of your script can run at the same time.
if (($fp = fopen('/tmp/mylockfile', "r+")) === false) {
die("Could not open lock file");
}
if (!flock($fp, LOCK_EX | LOCK_NB)) {
die("Could not obtain lock");
}
// run your code here
// release the lock and close file
fclose($fp);
OTOH If the script needs to run more than once per minute, you would need a different mechanism entirely.
Q: What is the best way to run a PHP script at a particular time, or interval?
A: Use cron
Q: Does a cronjob create a big load on the server?
A: Depends off course off your script. But checking if an auction should be closed, close it and send two emails shouldn't be to difficult. Be sure to create some kind of lockfile to make sure that if your script runs longer than the interval set, it isn't run twice.
Q: running a script with shorter intervals than 1 minute
A: Can't answer this one for you. Sorry :)
Use Cron. It allows you to run any command at most once per minute: http://clickmojo.com/code/cron-tutorial.html
As far as server load goes, it generally won't be a concern unless you are running a massive number of database calls very often on a very low-end server. I speak in generalities, but the idea is sound.
If you are using something else (besides PHP) to run your auction timer mechanism, I recommend you attach some code to that timer mechanism that also executes a mail-sending script when the timer runs down to zero and determines a winner.
Run the PHP script as a command line script. This will not put a load on the webserver - just a load on the server and you can easily run it via CRON.
If you add #!/usr/bin/php to the top of the script and change the execute bit on the file with chmod +x scriptname.php you can directly execute the script without passing it through php
http://php.net/manual/en/features.commandline.php
A couple of things you need to do this:
Store something in your auction information indicating whether you've sent this e-mail yet or not (could be a boolean or a date for when it was sent which might be null). Although I have to assume you need to do something besides send this e-mail? Like mark the auction as closed so no more bidding can take place?
A bit of code that finds auctions which need this e-mail sent: e.g. they've ended and have not yet been reminded.
Something to repeatedly execute the bit of code in 2. You could use cron. Alternatively you can write a pretty simple daemon for unix that runs constantly in a loop of (wait at least a few ms or more; do some stuff). The latter is a lot more work but in my opinion scales much better. See http://pear.php.net/package/System_Daemon for some useful tools if you're interested in this approach.
One thing to consider is how much you want to be careful about accidentally double-sending this e-mail. If you're only running this code in a single thread it's pretty easy but if you ever want to build out to the point where you have several different distributed machines that create and send these e-mails you have to be a bit more careful. If you're running it out of cron can you guarantee one run of it will always be finished before another one starts?
Recently I've been researching the use of Beanstalkd with PHP. I've learned quite a bit but have a few questions about the setup on a server, etc.
Here is how I see it working:
I install Beanstalkd and any dependencies (such as libevent) on my Ubuntu server. I then start the Beanstalkd daemon (which should basically run at all times).
Somewhere in my website (such as when a user performs some actions, etc) tasks get added to various tubes within the Beanstalkd queue.
I have a bash script (such as the following one) that is run as a deamon that basically executes a PHP script.
#!/bin/sh
php worker.php
4) The worker script would have something like this to execute the queued up tasks:
while(1) {
$job = $this->pheanstalk->watch('test')->ignore('default')->reserve();
$job_encoded = json_decode($job->getData(), false);
$done_jobs[] = $job_encoded;
$this->log('job:'.print_r($job_encoded, 1));
$this->pheanstalk->delete($job);
}
Now here are my questions based on the above setup (which correct me if I'm wrong about that):
Say I have the task of importing an RSS feed into a database or something. If 10 users do this at once, they'll all be queued up in the "test" tube. However, they'd then only be executed one at a time. Would it be better to have 10 different tubes all executing at the same time?
If I do need more tubes, does that then also mean that i'd need 10 worker scripts? One for each tube all running concurrently with basically the same code except for the string literal in the watch() function.
If I run that script as a daemon, how does that work? Will it constantly be executing the worker.php script? That script loops until the queue is empty theoretically, so shouldn't it only be kicked off once? How does the daemon decide how often to execute worker.php? Is that just a setting?
Thanks!
If the worker isn't taking too long to fetch the feed, it will be fine. You can run multiple workers if required to process more than one at a time. I've got a system (currently using Amazon SQS, but I've done similar with BeanstalkD before), with up to 200 (or more) workers pulling from the queue.
A single worker script (the same script running multiple times) should be fine - the script can watch multiple tubes at the same time, and the first one available will be reserved. You can also use the job-stat command to see where a particular $job came from (which tube), or put some meta-information into the message if you need to tell each type from another.
A good example of running a worker is described here. I've also added supervisord (also, a useful post to get started) to easily start and keep running a number of workers per machine (I run shell scripts, as in the first link). I would limit the number of times it loops, and also put a number into the reserve() to have it wait for a few seconds, or more, for the next job the become available without spinning out of control in a tight loop that does not pause at all - even if there was nothing to do.
Addendum:
The shell script would be run as many times as you need. (the link show how to have it re-run as required with exec $#). Whenever the php script exits, it re-runs the PHP.
Apparently there's a Djanjo app to show some stats, but it's trivial enough to connect to the daemon, get a list of tubes, and then get the stats for each tube - or just counts.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.
I'm working on a PHP web interface that will receive huge traffic. Some insert/update requests will contain images that will have to be resized to some common sizes to speed up their further retrieval.
One way to do it is probably to set up some asynchronous queue on the server. Eg. set up a table in a db with a tasks queue that would be populated by PHP requests and let some other process on the server watch the table and process any waiting tasks. How would you do that? What would be the proper environment for that long running process? Java, or maybe something lighter would do?
If what you're doing is really high volume then what you're looking for is something like beanstalkd. It is a distributed work queue processor. You just put a job on the queue and then forget about it.
Of course then you need something at the other end reading the queue and processing the work. There are multiple ways of doing this.
The easiest is probably to have a cron job that runs sufficiently often to read the work queue and process the requests. Alternatively you can use some kind of persistent daemon process that is woken up by work becoming available.
The advantage of this kind of approach is you can tailor the number of workers to how much work needs to get done and beanstalkd handles distributed prorcessing (in the sense that the listners can be on different machines).
You may set a cron task that would check the queue table. The script that handles actions waiting in the queue can be written e.g. in php so you don't have to change implementation language.
I use Perl for long running process in combination with beanstalkd. The nice thing is that the Beanstalkd client for Perl has a blocking reserve method. This way it uses almost no CPU time when there is nothing to do. But when it has to do its job, it will automatically start processing. Very efficient.
You would want to create a daemon which would "sleep" for a period of time and then check the database for items to process. Once it found items to process, it would process them and then check again as soon as it was done, if no more, then sleep.
You can create daemon's in any language, including PHP.
Alternatively, you can just have PHP execute a script and continue on. So that PHP won't wait for the script to finish before continuing, execute it in the background.
exec("nohup /usr/bin/php -f /path/to/script/script.php > /dev/null 2>&1 &");
Although you have to be careful with that since you could end up having too many processes running in the background since there is no queueing.
You could use a service like IronWorker to do image processing in the background and take the load off your servers. Since it's a service, you won't need to manage anything or set anything else up and it will scale with you as you grow so if you can do one image with it, you can scale to millions of images with zero effort.
Here's an article on how to do a bunch of image processing transformations:
http://dev.iron.io/solutions/image-processing/
The examples there are in Ruby, but you could do the same stuff with PHP pretty easily.