Speeding up a PHP App - php

I have a list of data that needs to be processed. The way it works right now is this:
A user clicks a process button.
The PHP code takes the first item that needs to be processed, takes 15-25 secs to process it, moves on to the next item, and so on.
This takes way too long. What I'd like instead is that:
The user clicks the process button.
A PHP script takes the first item and starts to process it.
Simultaneously another instance of the script takes the next item and processes it.
And so on, so around 5-6 of the items are being process simultaneously and we get 6 items processed in 15-25 secs instead of just one.
Is something like this possible?
I was thinking that I use CRON to launch an instance of the script every second. All items that need to be processed will be flagged as such in the MySQL database, so whenever an instance is launched through CRON, it will simply take the next item flagged to be processed and remove the flag.
Thoughts?
Edit: To clarify something, each 'item' is stored in a mysql database table as seperate rows. Whenever processing starts on an item, it is flagged as being processed in the db, hence each new instance will simply grab the next row which is not being processed and process it. Hence I don't have to supply the items as command line arguments.

Here's one solution, not the greatest, but will work fine on Linux:
Split the processing PHP into a separate CLI scripts in which:
The command line inputs include `$id` and `$item`
The script writes its PID to a file in `/tmp/$id.$item.pid`
The script echos results as XML or something that can be read into PHP to stdout
When finished the script deletes the `/tmp/$id.$item.pid` file
Your master script (presumably on your webserver) would do:
`exec("nohup php myprocessing.php $id $item > /tmp/$id.$item.xml");` for each item
Poll the `/tmp/$id.$item.pid` files until all are deleted (sleep/check poll is enough)
If they are never deleted kill all the processing scripts and report failure
If successful read the from `/tmp/$id.$item.xml` for format/output to user
Delete the XML files if you don't want to cache for later use
A backgrounded nohup started application will run independent of the script that started it.
This interested me sufficiently that I decided to write a POC.
test.php
<?php
$dir = realpath(dirname(__FILE__));
$start = time();
// Time in seconds after which we give up and kill everything
$timeout = 25;
// The unique identifier for the request
$id = uniqid();
// Our "items" which would be supplied by the user
$items = array("foo", "bar", "0xdeadbeef");
// We exec a nohup command that is backgrounded which returns immediately
foreach ($items as $item) {
exec("nohup php proc.php $id $item > $dir/proc.$id.$item.out &");
}
echo "<pre>";
// Run until timeout or all processing has finished
while(time() - $start < $timeout)
{
echo (time() - $start), " seconds\n";
clearstatcache(); // Required since PHP will cache for file_exists
$running = array();
foreach($items as $item)
{
// If the pid file still exists the process is still running
if (file_exists("$dir/proc.$id.$item.pid")) {
$running[] = $item;
}
}
if (empty($running)) break;
echo implode($running, ','), " running\n";
flush();
sleep(1);
}
// Clean up if we timeout out
if (!empty($running)) {
clearstatcache();
foreach ($items as $item) {
// Kill process of anything still running (i.e. that has a pid file)
if(file_exists("$dir/proc.$id.$item.pid")
&& $pid = file_get_contents("$dir/proc.$id.$item.pid")) {
posix_kill($pid, 9);
unlink("$dir/proc.$id.$item.pid");
// Would want to log this in the real world
echo "Failed to process: ", $item, " pid ", $pid, "\n";
}
// delete the useless data
unlink("$dir/proc.$id.$item.out");
}
} else {
echo "Successfully processed all items in ", time() - $start, " seconds.\n";
foreach ($items as $item) {
// Grab the processed data and delete the file
echo(file_get_contents("$dir/proc.$id.$item.out"));
unlink("$dir/proc.$id.$item.out");
}
}
echo "</pre>";
?>
proc.php
<?php
$dir = realpath(dirname(__FILE__));
$id = $argv[1];
$item = $argv[2];
// Write out our pid file
file_put_contents("$dir/proc.$id.$item.pid", posix_getpid());
for($i=0;$i<80;++$i)
{
echo $item,':', $i, "\n";
usleep(250000);
}
// Remove our pid file to say we're done processing
unlink("proc.$id.$item.pid");
?>
Put test.php and proc.php in the same folder of your server, load test.php and enjoy.
You will of course need nohup (unix) and PHP cli to get this to work.
Lots of fun, I may find a use for it later.

Use an external workqueue like Beanstalkd which your PHP script writes a bunch of jobs too. You have as many worker processes pulling jobs from beanstalkd and processing them as fast as possible. You can spin up as many workers as you have memory / CPU. Your job body should contain as little information as possible, maybe just some IDs which you hit the DB with. beanstalkd has a slew of client APIs and itself has a very basic API, think memcached.
We use beanstalkd to process all of our background jobs, I love it. Easy to use, its very fast.

There is no multithreading in PHP, however you can use fork.
php.net:pcntl-fork
Or you could execute a system() command and start another process which is multithreaded.

can you implementing threading in javascript on the client side? seems to me i've seen a javascript library (from google perhaps?) that implements it. google it and i'm sure you'll find something. i've never done it, but i know its possible. anyway, your client-side javascript could activate (ajax) a php script once for each item in separate threads. that might be easier than trying to do it all on the server side.
-don

If you are running a high traffic PHP server you are INSANE if you do not use Alternative PHP Cache: http://php.net/manual/en/book.apc.php . You do not have to make code modifications to run APC.
Another useful technique that can work along with APC is using the Smarty template system which allows you to cache output so that pages do not have to be rebuilt.

To solve this problem, I've used two different products; Gearman and RabbitMQ.
The benefit of putting your jobs into some sort of queuing software like Gearman or Rabbit is that you have multiple machines, they can all participate in processing items off the queue(s).
Gearman is easier to setup, so I'd suggest poking around with it a bit first. If you find you need something more heavy duty with queue robustness; Look into RabbitMQ
http://www.danga.com/gearman/
http://pear.php.net/package/Net_Gearman (PEAR library)

You can use pcntl_fork() and family to fork a process - however you may need something like IPC to communicate back to the parent process that the child process (the one you fork'd) is finished.
You could have them write to shared memory, like via memcache or a DB.
You could also have the child process write the completed data to a file, that the parent process keeps checking - as each child process completes the file is created/written to/updated, and parent process can grab it, one at a time, and them throw them back to the callee/client.
The parent's job is to control the queue, to make sure the same data isn't processed twice and also to sanity check the children (better kill that runaway process and start over...etc)
Something else to keep in mind - on windows platforms you are going to be severely limited - I dont even think you have access to pcntl_ unless you compiled PHP with support for it.
Also, can you cache the data once its been processed, or is it unique data every time? that would surely speed things up..?

Related

In PHP, exec fails silently, sometimes, when calling many exec commands, but the same command run again later will work

I have a PHP script that uses exec('command args > /log/file &'); within a loop to create multiple child scripts that run at the same time. Basically, the parent script gets user information out of a database and creates child scripts running in parallel, then the child script creates an email to send to a single user. This happens approximately 50,000 times.
To prevent the creation of 50,000 simultaneously running processes, I have a database table that keeps track of the currently running processes, and before creating a new process the parent checks the current child count and sleeps if 25 children are currently active. The child, upon completing its task, deletes its row in the table, freeing the parent to create more children.
The problem is, about 10% of the exec commands fail silently, and for seemingly no reason. I can run the parent script again (it's smart enough not to email the same user twice), and it will work, once again, 90% of the time using the same exec commands that failed last time. Running the script five or six times in a row will email everyone.
By putting a sleep immediately after the exec, I can increase my success rate to around 95%.
Why would exec be failing, if the same command will work later? I can just keep the script repeating until it completes, but I'd much rather solve the exec problem.
Some highly simplified sample code:
Parent script:
do {
//get user, group, and supergroup information for users that haven't
//been emailed yet
foreach ($users as $userArray) {
$processId = insertIntoProcessQueue($userArray);
$cmd = 'sudo php -q ./childScript.php ' . cliArg($userArray) . ' ' .
cliArg($groupArray) . ' ' . cliArg($supergroupArray)
' ' . $proccessId . ' > file.log &';
exec($cmd);
do {
if (numChildren() >= 25) {
sleep(1);
$waiting = true;
}
} while ($waiting);
}
$incomplete = moreUsersToEmail() > 0 ? true : false;
} while ($incomplete);
function cliArg($array) {
return escapeshellarg(json_encode($arg));
}
Child script:
ignore_user_abort(true);
$user = json_decode($argv[1]);
$group = json_decode($argv[2]);
$supergroup = json_decode($argv[3]);
print_r($user);
$email = createEmail($user, $group, $supergroup);
$email->sendEmail();
removeFromProcessQueue($argv[4]);
flush();
exit;
The print_r will only show up in the log file when the script completes and I never get any errors, so I can't get any data about why it's failing. To add to that, it doesn't fail consistently on any individual users, and it doesn't fail running a single user at a time, so I have to run the script through everyone and try and catch the errors amidst the 45,000 that are working properly. And, since the parent and child never communicate beyond the parent starting the child, I can't detect (from the parent) when a child fails (otherwise I could immediately try and start any failed children again instead of rerunning the parent post-hoc).
Edit: So it turns out there's an included script that's dynamically generated and is destroyed and regenerated every time it's used (don't ask me why), which creates a race condition while running processes in parallel that caused the script to fail.
Thanks everyone for your unfortunately wasted time.
I just looked at the PHP docs for exec() and you can pass an array as a reference with a second parameter which will be filled with the output of exec. You can use this to determine a) why the command is failing and b) when the command fails and integrate that into your code.
So I'd change:
exec($cmd);
To something like:
function check_exec_results($results)
{
echo '<HR><PRE>',print_r($output,true),'</PRE><HR>'; //use this to figure out what output you're getting from the exec commands then remove when you've figured out a way to set $results_look_good below
$results_look_good = ?; //you will need to edit this yourself to actually do some kind of check
return $results_look_good;
}
$successful_exec = false;
do
{
$exec_results = array();
exec($cmd,$exec_results);
$successful_exec = check_exec_results($exec_results);
}
while (!$successful_exec);
Note that this is potentially an infinite loop so I'd also go a step further and set a limit to the number of times exec() can be called for each user.
So it turns out there's an included script that's dynamically generated and is destroyed and regenerated every time it's used (don't ask me why), which creates a race condition while running processes in parallel that caused the script to fail.
Thanks everyone for your unfortunately wasted time.

Multiple "agents" handling a single array

Apologies if this has been covered before - I did my searching but possibly may not know the correct terms to have used.
This process is handled with PHP.
Here's the situation:
I have a large array of file names. The script I have opens these files and enters their content into a database. Processing these files one at a time takes over 24 hours, and these files are updated on a daily basis.
Breaking the single large array into four smaller arrays and running concurrent processes finishes the job before the 24 hour window elapses, but sometimes one or two processes will finish hours before the others because file sizes vary on a daily basis.
Much like people who stock retail shelves (who else has worked that nightmare before?) pitch in to help out with what's left after finishing their own tasks, I'd like to have a script in place where these "agents" do the same.
Here's some basics of what I have figured out - it could be wrong, and I'm not too proud to protest if I am :-)
$files = array('file1','file2','file3','file4','file5');
//etc... on to over 4k elements
while($file = array_pop($files)){
//Something in here... I have no idea what.
}
Ideas? Something like four function calls or four loops within that overarching 'while' has crossed my mind, but I'm pretty sure it's going to wait on executing subsequent calls until the previous one(s) finish.
Any help is appreciated. I'm seriously stuck on this one!
Thanks!
A database-backed message queue seems the obvious solution but I think that's overkill in this case. I would simply put the files to be processed into a single dedicated queue directory, then use the DirectoryIterator class to scan it. Something like this:
while (true) {
look in the queue directory for a file
if you don't fine one, exit the script, all processing is done
if you find one, rename it or move it to a work directory
if the rename/move command succeeded, process the file
if the rename/move command failed, one of the other threads got it first
}
Edit:
Regarding launching the workers, you could use a simple shell script to spawn the PHP processes in the background:
NUM_WORKERS=5
for WORKER in $(seq 1 ${NUM_WORKERS})
do
echo "starting worker ${WORKER}"
php -f /path/to/my/process.php &
done
Then, create a cron entry to run this launcher, for example, at midnight:
0 0 * * * /path/to/launcher.sh
You want what's called a "message queue". Something like beanstalkd
You'll basically create a list of messages that include your individual filenames. You'll then create a set of processors to process them. Each processor will handle one file then go back to the queue to see if there are more messages/files waiting to be processed.
EDIT:
Here's an analogy to help explain message queues. Your first idea is like a human manager taking a stack of files, dividing them into four piles and then handing each of his four employees a pile to process. A message queue is more like this: the manager puts all the files on a table and tells each employee to take a single file from the table and process it. He tells them when they're done with the first file to keep taking files until there are no more files on the table. When all the files are done, the employees can go home.
One employee might end up with really large files and only handle a few, while another employee might get smaller files and handle many. It doesn't matter how many each employee handles, they'll all keep working until the table is empty.
I would have a socket server master script that hands out file paths to x number of slave scripts, until there are no files left to process. This way, all the slave scripts will keep running, and you can hand out file paths dynamically as they are requested.
Something like this:
master.php
<?php
// load the array of files to process (however you do this)
$fileList = file('filelist.txt');
// Create a listening socket on localhost
$serverSocket = stream_socket_server('tcp://127.0.0.1:7878');
$sockets = array($serverSocket);
$clients = array();
// Loop while there are still files to process
while (count($fileList)) {
// Run a select() call on the existing sockets' read buffers
// Skip to next iteration if no sockets are waiting for handling
if (stream_select($read = $sockets, $write = NULL, $except = NULL, 1) < 1) {
continue;
}
// Loop sockets with data to read
foreach ($read as $socket) {
if ($socket == $serverSocket) {
// Accept new clients
$sockets[] = $clients[] = stream_socket_accept($serverSocket);
} else if (trim(fgets($socket)) == 'next') {
// Hand out a new file path to the client
fwrite($socket, array_shift($fileList)."\n");
if (!count($fileList)) {
break 2;
}
}
}
}
// When we're done, disconnect the clients
foreach ($clients as $socket) {
#fclose($socket);
}
// ...and close the listen socket
#fclose($serverSocket);
slave.php
<?php
$socket = fsockopen('127.0.0.1', 7878);
while (!feof($socket)) {
// Get a new file path from the master
fwrite($socket,"next\n");
$path = trim(fgets($socket));
if (is_file($path)) {
// Process the file at $path here
}
}
You then just need to start master.php, then when it is running, you can start however many instances of slave.php as you want, and they will all keep running until there are no more files to process.
Obviously, this has no error handling, but it should provide a basic framework to get you started. This relies on blocking function calls (stream_select() and fgets()) to avoid a race condition - this may or may not be sufficient for your purposes.

Progress Bar when running function inside foreach loop

I have a foreach loop that calls a function to set values to an array. Sometimes it takes hours to complete depending on how many times it has to run thru the function to complete.
What I would like to have is a progress bar or at least a 1/1000 completed type progress indicator.
Is this possible? If so how could I implement this into my code? Would it be in the function or in the foreach loop? Been researching and found some examples using for and $i++ but I am not really sure how to implement that since I am already using a foreach loop.
Thanks much.
function scrape_amazon($links) {
//my code runs here to set all values in $ret array.
}
foreach($links as $link) {
$ret = scrape_amazon($link);
}
PHP probably isn't really the right tool for this task, however what you could do is:
Launch the slow code as a background process, and output progress to a file.
Have a PHP script that polls that file for progress information (either by page refresh or AJAX)
Launching the background process can be done in several ways, including:
Launch via cron every 60 seconds, and poll for new jobs spooled in some readable area
Launch via a fork/exec mechanism from a web page
Launch as a daemon at system startup
It will take some effort to avoid problems with multiple executions and/or overlap.
I use this, which well, not an ajax, do only flushing, but not so ugly.
I place an image
<img src='progress.gif' height=18 width=0 name=probar>
Then set on every event done on server a echo a line, then flush:
echo "<script language='JavaScript'>\ndocument.probar.width=".(($sys["probar_width"]/$task_all)*$task_i).";\n</script>\n";
flush();
If your server (eg. apache) use caching (eg. gzip is enabled) it won't work well.

Executing functions parallelly in PHP

Can PHP call a function and don't wait for it to return? So something like this:
function callback($pause, $arg) {
sleep($pause);
echo $arg, "\n";
}
header('Content-Type: text/plain');
fast_call_user_func_array('callback', array(3, 'three'));
fast_call_user_func_array('callback', array(2, 'two'));
fast_call_user_func_array('callback', array(1, 'one'));
would output
one (after 1 second)
two (after 2 seconds)
three (after 3 seconds)
rather than
three (after 3 seconds)
two (after 3 + 2 = 5 seconds)
one (after 3 + 2 + 1 = 6 seconds)
Main script is intended to be run as a permanent process (TCP server). callback() function would receive data from client, execute external PHP script and then do something based on other arguments that are passed to callback(). The problem is that main script must not wait for external PHP script to finish. Result of external script is important, so exec('php -f file.php &') is not an option.
Edit:
Many have recommended to take a look at PCNTL, so it seems that such functionality can be achieved. PCNTL is not available in Windows, and I don't have an access to a Linux machine right now, so I can't test it, but if so many people have advised it, then it should do the trick :)
Thanks, everyone!
On Unix platforms you can enable the PCNTL functions, and use pcntl_fork to fork the process and run your jobs in child processes.
Something like:
function fast_call_user_func_array($func, $args) {
if (pcntl_fork() == 0) {
call_user_func_array($func, $args);
}
}
Once you call pcntl_fork, two processes will execute your code from the same position. The parent process will get a PID returned from pcntl_fork, while the child process will get 0. (If there's an error the parent process will return -1, which is worth checking for in production code).
You can check out PHP Process Control:
http://us.php.net/manual/en/intro.pcntl.php
Note: This is not threading, but the handling of separate processes. There is more overhead attached.
Wouldn't it solve your problem to fork, keeping the parent process free for other connections & actions? See http://www.php.net/pcntl_fork. If you need an answer back you could possibly listen to a socket in the parent, and write with the child. A simple while(true) loop with a read could possibly do, and probably you already have that basic functionality if you run a permanent TCP server. Another option would be to keep track of your childprocess-ids, keep a accessable store somewhere (file/database/memcached etc), with a pcnt_wait in the main process with a WNOHANG to check which process has exited, and retrieve the data from the store.
You can do some threading in PHP if you use the method pcntl_fork.
http://ca.php.net/manual/en/function.pcntl-fork.php
I have never use this myself, but the are some good example of how to use it on php.net.
PHP doesn't have this functionality as far as I know
You can emulate the function using a different technique, like this one:
Parallel functions in PHP
PHP does not support multi-threading, so there's no other option than taking advantage of the OS or the web server multi processing capabilities. Note that actually you can fetch both the result and output of exec:
string exec ( string $command [,
array &$output [, int &$return_var
]] )
You can, at least, prevent the parent process from hanging until the child process is done by ignoring the child signals using pcntl_signal(SIGCHLD, SIG_IGN).
So, let's say you want to fork a process and execute another PHP function that takes a while without making the parent wait for it to finish (since you want the main process to finish in a timely manner):
pcntl_signal(SIGCHLD, SIG_IGN);
$pid = pcntl_fork();
if ($pid < 0) {
exit(0);
} elseif (!$pid) {
my_slow_function();
exit(0);
}
// Parent keeps executing and finishes before the child does
If you want to execute a slow external script as the child process, pcntl_exec is handy:
$script = array('/path/to/my/script'); // E.g. /home/my_user/my_script.php
pcntl_exec('/path/to/program/executable',$script); // E.g. /usr/bin/php

PHP: How to return information to a waiting script and continue processing

Suppose there are two scripts Requester.php and Provider.php, and Requester requires processing from Provider and makes an http request to it (Provider.php?data="data"). In this situation, Provider quickly finds the answer, but to maintain the system must perform various updates throughout the database. Is there a way to immediately return the value to Requester, and then continue processing in Provider.
Psuedo Code
Provider.php
{
$answer = getAnswer($_GET['data']);
echo $answer;
//SIGNAL TO REQUESTER THAT WE ARE FINISHED
processDBUpdates();
return;
}
You can flush the output buffer with the flush() command.
Read the comments in the PHP manual for more info
I use this code for running a process in the background (works on Linux).
The process runs with its output redirected to a file.
That way, if I need to display status on the process, it's just a matter of writing a small amount of code to read and display the contents of the output file.
I like this approach because it means you can completely close the browser and easily come back later to check on the status.
You basically want to signal the end of 1 process (return to the original Requester.php) and spawn a new process (finish Provider.php). There is probably a more elegant way to pull this off, but I've managed this a couple different ways. All of them basically result in exec-ing a command in order to shell off the second process.
adding the following > /dev/null 2>&1 & to the end of your command will allow it to run in the background without inhibiting the actual execution of your current script
Something like the following may work for you:
exec("wget -O - \"$url\" > /dev/null 2>&1 &");
-- though you could do it as a command line PHP process as well.
You could also save the information that needs to be processed and handle the remaining processing on a cron job that re-creates the same sort of functionality without the need to exec.
I think you'll need on the provider to send the data (be sure to flush), and then on the Requester, use fopen/fread to read an expected amount of data, so you can drop the connection to the Provider and continue. If you don't specify an amount of data to expect, I would think the requester would sit there waiting for the Provider to close the connection, which probably doesn't happen until the end of it's run (ie. all the secondary work intensive tasks are complete). You'll need to try out a few POC's..
Good luck.
Split the Provider in two: ProviderCore and ProviderInterface. In ProviderInterface just do the "quick and easy" part, also save a flag in database that the recent request hasn't been processed yet. Run ProviderCore as a cron job that searches for that flag and completes processing. If there's nothing to do, ProviderCore will terminate and retry in (say) 2 minutes.
I'm going out on a limb here, but perhaps you should try cURL or use a socket to update the requester?
You could start another php process in Provider.php using pcntl_fork()
Provider.php
{
// Fork process
$pid = pcntl_fork();
// You are now running both a daemon process and the parent process
// through the rest of the code below
if ($pid > 0) {
// PARENT Process
$answer = getAnswer($_GET['data']);
echo $answer;
//SIGNAL TO REQUESTER THAT WE ARE FINISHED
return;
}
if ($pid == 0) {
// DAEMON Process
processDBUpdates();
return;
}
// If you get here the daemon process failed to start
handleDaemonErrorCondition();
return;
}

Categories