I tried to run a massive update of field values through an API and I ran into maximum execution time for my PHP script.
I divided my job into smaller tasks to run them asynchronously as smaller jobs...
Asynchronous PHP calls?
I found this post and It looks about right but the comments are a little off-putting... Will using curl to run external script files prevent the caller file triggering maximum execution time or will the curl still wait for a response from the server and kill my page?
The question really is: How do you do asynchronous jobs in PHP? Something like Ajax.
EDIT::///
There is a project management tool which has lots of rows of data.
I am using this tools API to access the rows of data and display them on my page.
The user using my tool will select multiple rows of data with a checkbox, and type a new value into a box.
The user will then press an "update row values" button which runs an update script.
this update script divides the hundreds or thousands of items possibly selected into groups of 100.
At this point I was going to use some asynchronous method to contact the project management tool and update all 100 items.
Because when it is updating those items, it could take that server a long time to run its process, I need to make sure that my original page splitting those jobs is no longer waiting for a request from that operation so that I can fire off more requests to update items. and allow my server page to say to my user "Okay, the update is currently happening, it may take a while and we'll send an email once its complete".
$step = 100;
$itemCount = GetItemCountByAppId( $appId );
$loopsRequired = $itemCount / $step;
$loopsRequired = ceil( $loopsRequired );
$process = array();
for( $a = 0; $a < $loopsRequired; $a++ )
{
$items = GetItemsByAppId( $appId, array(
"amount" => $step,
"offset" => ( $step * $a )
) );
foreach( $items[ "items" ] as $key => $item )
{
foreach( $fieldsGroup as $fieldId => $fieldValues )
{
$itemId = $item->__attributes[ "item_id" ];
/*array_push( $process, array(
"itemId" => $itemId,
"fieldId" => $fieldId,
) );*/
UpdateFieldValue( $itemId, $fieldId, $fieldValues );
// This Update function is actually calling the server and I assume it must be waiting for a response... thus my code times out after 30 secs of execution
}
}
//curl_post_async($url, $params);
}
If you are using PHP-CLI, try Threads, or fork() for non-thread-safe version.
Depending on how you implement it, asynchronous PHP might be used to decouple the web request from the processing and therefore isolate the web request from any timeout in the procesing (but you could do the same thing within a single thread). Will breaking the task into smaller concurrent parts make it run faster? Probably not - usually this will extend the length of time it takes for the job to complete - about the only time this is not the case is when you've got a very large processing capacity and can distribute the task effective (e.g. map-reduce). Are HTTP calls (curl) an efficient way to distribute work like this? No. There are other methods, including synchronous and asynchronous messaging, batch processing, process forking, threads....each with their own benefits and complications - and we don't know what the problem you are trying to solve is.
So even before we get to your specific questions, this does not look like a good strategy.
Will using curl to run external script files prevent the caller file triggering maximum execution time
It will be constrained by whatever timeouts are configured on the target server - if that's the same server as the invoking script, then it will be the same timeouts.
will the curl still wait for a response from the server and kill my page?
I don't know what you're asking here - it rather implies that there are functional dependenciesyou've not told us about.
It sounds like you've picked a solution and are now trying to make it fit your problem.
Related
I have a foreach in cakephp that processes products from a distributor, but the thing is the lists have up to 200products each product can have 3 big pictures with 2 resizes.
So i have in total 1200 big actions to much for one request.
I breaked the foreach at each 10 products, removing them from the array and redirected to the same page. But after a while I get a redirect loop.
Any ideeas on how to avoid this?
If I add another page in this redirect freenzy will it work?
The redirect loop appears only when redirecting in the same page?
The thing is the loop will end, but the browser doesn't know that.
$this->data = $this->Session->read('Parser.data');
$limit = 0;
foreach ($this->data as $key => $data):
$limit++;
if ($limit == 4)
$this->redirect($this->here);
...
$this->Session->delete('Parser.data.' . $key);
endforeach;
$this->redirect(array('controller' => 'parser', 'action' => 'index')); //if $this->data is empty it redirects to upload page
The server work with any number of records from what I have tested, but I have this action along the lines:
$this->getImage(WWW_ROOT . $folder . DS, $new_path, $image['path']);
which looks like this:
protected function getImage($folder = null, $path = null, $from = null) {
if (isset($from) && !empty($from))
file_put_contents($folder . $path, file_get_contents($from));
}
this loads up the server's memory and crashes.
This is why I have to break the foreach a couple of times.
I also tried other functions to get the images as cUrl, but with same results!
Let me copy my answer from another very similar question:
Never use URLs to do these kind of tasks, it is simply plain wrong, insecure and can cause your script to die or the server to become not responding any more.
Lets say you have 10000 users and a script runtime of 30 sec, it is very likely that the script times out before it finished and you end up with just a part of your users being processed at this time. The other scenario with a high or infinite amount of script runtime can lock your server. Depending on the script or DB actions it might cause the server to have a high load and users who use the site while the script is running will encounter a horrible slow to non responding site.
Also you can't really run a loop on a single URL, well you could redirect from one to another that does the limit and offset thing to simulate a loop over the 100000 users. If you don't loop over the records but fetch all 100000 at the same time it's likely your script dies because of running out of memory.
You should create a shell that processes the users in a loop and always just processes batches of for example 10, 50 or 100 users.
When executing your shell I recommend to use it with the "nice" command together to limit the amount of CPU time the shell is allowed to use to prevent the shell from taking 100% CPU usage to keep your site responding.
Look at creating a shell
and setting up a cron in cake.
So I have this function for making non-blocking curl requests. It works fine on what I've tested so far (small amounts of requests). But I need this to scale up to thousands of requests (maybe max 10,000). My issue is that I don't want to run into issues with too many parallel requests running at once.
What would you suggest to rate-limit the requests? Usleep? Requests in batches? The function is below:
function poly_curl($requests){
$queue = curl_multi_init();
$curl_array = array();
$count = 0;
foreach($requests as $request)
{
$curl_array[$count] = curl_init($request);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($queue, $curl_array[$count]);
$count++;
}
$running = NULL;
do {
curl_multi_exec($queue,$running);
} while($running > 0);
$res = array();
$count = 0;
foreach($requests as $request)
{
$res[$count] = curl_multi_getcontent($curl_array[$count]);
$count++;
}
$count = 0;
foreach($requests as $request){
curl_multi_remove_handle($queue, $curl_array[$count]);
$count++;
}
curl_multi_close($queue);
return $res;
}
I think curl_multi_exec is bad for this purpose, because even if you use batches in groups of 100, 99 request could be finished and still will have to wait for the last request completion.
But you need 100 parallel requests and when one finishes, another is immediately started. So you cannot use curl_multi_exec at all.
I would use normal producer-consumer algorithm with multiple (constant number) consumers with every consumer processing only one url. For example php-resque and COUNT=100 php resque.php
You may want to implement something that is called Exponential Backoff (wikipedia).
Basically, it is an algorithm that allows you dynamically scale your processes depending on some feedback.
You define a rate in your application, and on the first time-out, error, or anything you decide, you decrease this rate until the request finishes.
You may implement it easily using the HTTP response code for example.
Last time i was doing something like this it was including downloading and "parsing" files. Was able to proceed only 4 subpages at a time limited by very weak hardware processor (2 cores with HT). What time i ended up with two queuest: 1 for waiting, 2 for in-process. Every time a task gone from 2nd queue, new was taken from 1st one.
May saund complicated, but ended in two loops inside eachother and simple count()'s
Btw, considering so hight rate i would think of using Node.js - for simplicity - or anything more nonblocking and more suitable for deamons than PHP.. As long as threads are PHP weakpoint, it just does not suit there.
PS: nice & useful bit of code, thanks.
We used to face the same problem with C++ connection pooling code. The approach is those days involved some serious analysis.
But, the essence was that, we created a pool and requests would get processed depending on number of available requests. What we also did was assign a maximum number of connection pools.[This was determined by testing].
What you really need is a method to determine how many requests are being processed and put a limit to it. In your case that is $count
Just compare $count to a maximum value[say, $max] to it and stop there. define the value depending on the system the program runs. $max could be hardcoded or dynamic.
I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.
I have a script that is very long to execute, so when i run it it hit the max execution time on my webserver and end up timing out.
To illustrate that imagine i have a for loop that make some pretty intensive manipulation one million time. How could i spread this loop execution in several parts so that i don t hit the max execution time of my Webserver?
Many thanks,
If you have an application that is going to loop a known number of times (i.e. you are sure that it's going to finish some time) you can increase time limit inside the loop:
foreach ($data as $row) {
set_time_limit(10);
// do your stuff here
}
This solution will protect you from having one run-away iteration, but will let your whole script run undisturbed as long as you need.
Best solution is to use http://php.net/manual/en/function.set-time-limit.php to change the timeout. Otherwise, you can use 301 redirects to send to an updated URL on a timeout.
$threshold = 10000;
$t = microtime();
$i = isset( $_GET['i'] ) ? $_GET['i'] : 0;
for( $i; $i < 10000000; $i++ )
{
if( microtime - $t > $threshold )
{
header('Location: http://www.example.com/?i='.$i);
exit;
}
// Your code
}
The browser will only respect a few redirects before it stops, you're better to use javascript to force a page reload.
I someday used a technique where I splitted the work from one file into three parts. It was just an array of 120.000 elements with intensive operation. I created a splitter script which stored the arrays in a database of the size of 40.000 each one. Then I created an HTML file with a redirect to the first PHP file to compute the first 40.000 elements. After computing the first 40.000 elments I had again a HTML forward to the next PHP file and so on.
Not very elegant, but it worked :-)
If you have the right permissions on your hosting server, you could use the php interpreter to execute a php script and have it run in the background.
See Asynchronous shell exec in PHP.
if you are running a script that needs to execute for unknown time, you can use:
set_time_limit(0);
If possible you can make the script so that it handles a portion of the wanted operations. Once it completes say 10%, you via AJAX call the script again to execute the next 10%. But there are circumstances where this is not an ideal solution, it really depends on what you are doing.
I used this method to create a web-based crawler which only ran on my computer for instance. If it had to do the operations at once it would time out as well. So it was split into 200 "tasks", each called via Ajax once the previous completes. Works perfectly, and it's been over a year since it started running (crawling?)
I have a list of data that needs to be processed. The way it works right now is this:
A user clicks a process button.
The PHP code takes the first item that needs to be processed, takes 15-25 secs to process it, moves on to the next item, and so on.
This takes way too long. What I'd like instead is that:
The user clicks the process button.
A PHP script takes the first item and starts to process it.
Simultaneously another instance of the script takes the next item and processes it.
And so on, so around 5-6 of the items are being process simultaneously and we get 6 items processed in 15-25 secs instead of just one.
Is something like this possible?
I was thinking that I use CRON to launch an instance of the script every second. All items that need to be processed will be flagged as such in the MySQL database, so whenever an instance is launched through CRON, it will simply take the next item flagged to be processed and remove the flag.
Thoughts?
Edit: To clarify something, each 'item' is stored in a mysql database table as seperate rows. Whenever processing starts on an item, it is flagged as being processed in the db, hence each new instance will simply grab the next row which is not being processed and process it. Hence I don't have to supply the items as command line arguments.
Here's one solution, not the greatest, but will work fine on Linux:
Split the processing PHP into a separate CLI scripts in which:
The command line inputs include `$id` and `$item`
The script writes its PID to a file in `/tmp/$id.$item.pid`
The script echos results as XML or something that can be read into PHP to stdout
When finished the script deletes the `/tmp/$id.$item.pid` file
Your master script (presumably on your webserver) would do:
`exec("nohup php myprocessing.php $id $item > /tmp/$id.$item.xml");` for each item
Poll the `/tmp/$id.$item.pid` files until all are deleted (sleep/check poll is enough)
If they are never deleted kill all the processing scripts and report failure
If successful read the from `/tmp/$id.$item.xml` for format/output to user
Delete the XML files if you don't want to cache for later use
A backgrounded nohup started application will run independent of the script that started it.
This interested me sufficiently that I decided to write a POC.
test.php
<?php
$dir = realpath(dirname(__FILE__));
$start = time();
// Time in seconds after which we give up and kill everything
$timeout = 25;
// The unique identifier for the request
$id = uniqid();
// Our "items" which would be supplied by the user
$items = array("foo", "bar", "0xdeadbeef");
// We exec a nohup command that is backgrounded which returns immediately
foreach ($items as $item) {
exec("nohup php proc.php $id $item > $dir/proc.$id.$item.out &");
}
echo "<pre>";
// Run until timeout or all processing has finished
while(time() - $start < $timeout)
{
echo (time() - $start), " seconds\n";
clearstatcache(); // Required since PHP will cache for file_exists
$running = array();
foreach($items as $item)
{
// If the pid file still exists the process is still running
if (file_exists("$dir/proc.$id.$item.pid")) {
$running[] = $item;
}
}
if (empty($running)) break;
echo implode($running, ','), " running\n";
flush();
sleep(1);
}
// Clean up if we timeout out
if (!empty($running)) {
clearstatcache();
foreach ($items as $item) {
// Kill process of anything still running (i.e. that has a pid file)
if(file_exists("$dir/proc.$id.$item.pid")
&& $pid = file_get_contents("$dir/proc.$id.$item.pid")) {
posix_kill($pid, 9);
unlink("$dir/proc.$id.$item.pid");
// Would want to log this in the real world
echo "Failed to process: ", $item, " pid ", $pid, "\n";
}
// delete the useless data
unlink("$dir/proc.$id.$item.out");
}
} else {
echo "Successfully processed all items in ", time() - $start, " seconds.\n";
foreach ($items as $item) {
// Grab the processed data and delete the file
echo(file_get_contents("$dir/proc.$id.$item.out"));
unlink("$dir/proc.$id.$item.out");
}
}
echo "</pre>";
?>
proc.php
<?php
$dir = realpath(dirname(__FILE__));
$id = $argv[1];
$item = $argv[2];
// Write out our pid file
file_put_contents("$dir/proc.$id.$item.pid", posix_getpid());
for($i=0;$i<80;++$i)
{
echo $item,':', $i, "\n";
usleep(250000);
}
// Remove our pid file to say we're done processing
unlink("proc.$id.$item.pid");
?>
Put test.php and proc.php in the same folder of your server, load test.php and enjoy.
You will of course need nohup (unix) and PHP cli to get this to work.
Lots of fun, I may find a use for it later.
Use an external workqueue like Beanstalkd which your PHP script writes a bunch of jobs too. You have as many worker processes pulling jobs from beanstalkd and processing them as fast as possible. You can spin up as many workers as you have memory / CPU. Your job body should contain as little information as possible, maybe just some IDs which you hit the DB with. beanstalkd has a slew of client APIs and itself has a very basic API, think memcached.
We use beanstalkd to process all of our background jobs, I love it. Easy to use, its very fast.
There is no multithreading in PHP, however you can use fork.
php.net:pcntl-fork
Or you could execute a system() command and start another process which is multithreaded.
can you implementing threading in javascript on the client side? seems to me i've seen a javascript library (from google perhaps?) that implements it. google it and i'm sure you'll find something. i've never done it, but i know its possible. anyway, your client-side javascript could activate (ajax) a php script once for each item in separate threads. that might be easier than trying to do it all on the server side.
-don
If you are running a high traffic PHP server you are INSANE if you do not use Alternative PHP Cache: http://php.net/manual/en/book.apc.php . You do not have to make code modifications to run APC.
Another useful technique that can work along with APC is using the Smarty template system which allows you to cache output so that pages do not have to be rebuilt.
To solve this problem, I've used two different products; Gearman and RabbitMQ.
The benefit of putting your jobs into some sort of queuing software like Gearman or Rabbit is that you have multiple machines, they can all participate in processing items off the queue(s).
Gearman is easier to setup, so I'd suggest poking around with it a bit first. If you find you need something more heavy duty with queue robustness; Look into RabbitMQ
http://www.danga.com/gearman/
http://pear.php.net/package/Net_Gearman (PEAR library)
You can use pcntl_fork() and family to fork a process - however you may need something like IPC to communicate back to the parent process that the child process (the one you fork'd) is finished.
You could have them write to shared memory, like via memcache or a DB.
You could also have the child process write the completed data to a file, that the parent process keeps checking - as each child process completes the file is created/written to/updated, and parent process can grab it, one at a time, and them throw them back to the callee/client.
The parent's job is to control the queue, to make sure the same data isn't processed twice and also to sanity check the children (better kill that runaway process and start over...etc)
Something else to keep in mind - on windows platforms you are going to be severely limited - I dont even think you have access to pcntl_ unless you compiled PHP with support for it.
Also, can you cache the data once its been processed, or is it unique data every time? that would surely speed things up..?