Progress Bar when running function inside foreach loop - php

I have a foreach loop that calls a function to set values to an array. Sometimes it takes hours to complete depending on how many times it has to run thru the function to complete.
What I would like to have is a progress bar or at least a 1/1000 completed type progress indicator.
Is this possible? If so how could I implement this into my code? Would it be in the function or in the foreach loop? Been researching and found some examples using for and $i++ but I am not really sure how to implement that since I am already using a foreach loop.
Thanks much.
function scrape_amazon($links) {
//my code runs here to set all values in $ret array.
}
foreach($links as $link) {
$ret = scrape_amazon($link);
}

PHP probably isn't really the right tool for this task, however what you could do is:
Launch the slow code as a background process, and output progress to a file.
Have a PHP script that polls that file for progress information (either by page refresh or AJAX)
Launching the background process can be done in several ways, including:
Launch via cron every 60 seconds, and poll for new jobs spooled in some readable area
Launch via a fork/exec mechanism from a web page
Launch as a daemon at system startup
It will take some effort to avoid problems with multiple executions and/or overlap.

I use this, which well, not an ajax, do only flushing, but not so ugly.
I place an image
<img src='progress.gif' height=18 width=0 name=probar>
Then set on every event done on server a echo a line, then flush:
echo "<script language='JavaScript'>\ndocument.probar.width=".(($sys["probar_width"]/$task_all)*$task_i).";\n</script>\n";
flush();
If your server (eg. apache) use caching (eg. gzip is enabled) it won't work well.

Related

Single PHP script with multiple CRON jobs > optimize into > Single CRON?

I have a script that I run for multiple clients.
Same script, I'm just using a different GET variable to load the client credentials.
eg.
example.com/script.php?client=lego
example.com/script.php?client=nike
example.com/script.php?client=stackoverflow
I've setup multiple crons to hit the script at midnight, with each cron having a different client GET variable.
What would be the best way to run a single CRON but process all clients? So I don't need to setup a CRON each time for each client.
There can be various solutions but without knowing the code what comes to my mind is.
Delete all crons and setup just one.
example.com/script.php
Inside script.php wrap whatever you earlier had in a function, create an array of clients and call that function for every client by passing username. For example
<?php
// if you have lots of clients and script can exhaust time limit
ini_set('max_execution_time', 0);
$clients = ['lego', 'nike', 'stackoverflow'];
foreach ($clients as $client) {
myScript($client);
}
function myScript($client)
{
// Whatever you had in script.php earlier replacing $_GET['client'] with $client.
}
Hope it answers your question.

Downloading pages in parallel using PHP

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.

PHP spreading a script into multiple parts to avoid server timeout

I have a script that is very long to execute, so when i run it it hit the max execution time on my webserver and end up timing out.
To illustrate that imagine i have a for loop that make some pretty intensive manipulation one million time. How could i spread this loop execution in several parts so that i don t hit the max execution time of my Webserver?
Many thanks,
If you have an application that is going to loop a known number of times (i.e. you are sure that it's going to finish some time) you can increase time limit inside the loop:
foreach ($data as $row) {
set_time_limit(10);
// do your stuff here
}
This solution will protect you from having one run-away iteration, but will let your whole script run undisturbed as long as you need.
Best solution is to use http://php.net/manual/en/function.set-time-limit.php to change the timeout. Otherwise, you can use 301 redirects to send to an updated URL on a timeout.
$threshold = 10000;
$t = microtime();
$i = isset( $_GET['i'] ) ? $_GET['i'] : 0;
for( $i; $i < 10000000; $i++ )
{
if( microtime - $t > $threshold )
{
header('Location: http://www.example.com/?i='.$i);
exit;
}
// Your code
}
The browser will only respect a few redirects before it stops, you're better to use javascript to force a page reload.
I someday used a technique where I splitted the work from one file into three parts. It was just an array of 120.000 elements with intensive operation. I created a splitter script which stored the arrays in a database of the size of 40.000 each one. Then I created an HTML file with a redirect to the first PHP file to compute the first 40.000 elements. After computing the first 40.000 elments I had again a HTML forward to the next PHP file and so on.
Not very elegant, but it worked :-)
If you have the right permissions on your hosting server, you could use the php interpreter to execute a php script and have it run in the background.
See Asynchronous shell exec in PHP.
if you are running a script that needs to execute for unknown time, you can use:
set_time_limit(0);
If possible you can make the script so that it handles a portion of the wanted operations. Once it completes say 10%, you via AJAX call the script again to execute the next 10%. But there are circumstances where this is not an ideal solution, it really depends on what you are doing.
I used this method to create a web-based crawler which only ran on my computer for instance. If it had to do the operations at once it would time out as well. So it was split into 200 "tasks", each called via Ajax once the previous completes. Works perfectly, and it's been over a year since it started running (crawling?)

Is there a function similar to setTimeout() (JavaScript) for PHP?

The question sort of says it all - is there a function which does the same as the JavaScript function setTimeout() for PHP? I've searched php.net, and I can't seem to find any...
There is no way to delay execution of part of the code of in the current script. It wouldn't make much sense, either, as the processing of a PHP script takes place entirely on server side and you would just delay the overall execution of the script. There is sleep() but that will simply halt the process for a certain time.
You can, of course, schedule a PHP script to run at a specific time using cron jobs and the like.
There's the sleep function, which pauses the script for a determined amount of time.
See also usleep, time_nanosleep and time_sleep_until.
PHP isn't event driven, so a setTimeout doesn't make much sense. You can certainly mimic it and in fact, someone has written a Timer class you could use. But I would be careful before you start programming in this way on the server side in PHP.
A few things I'd like to note about timers in PHP:
1) Timers in PHP make sense when used in long-running scripts (daemons and, maybe, in CLI scripts). So if you're not developing that kind of application, then you don't need timers.
2) Timers can be blocking and non-blocking. If you're using sleep(), then it's a blocking timer, because your script just freezes for a specified amount of time.
For many tasks blocking timers are fine. For example, sending statistics every 10 seconds. It's ok to block the script:
while (true) {
sendStat();
sleep(10);
}
3) Non-blocking timers make sense only in event driven apps, like websocket-server. In such applications an event can occur at any time (e.g incoming connection), so you must not block your app with sleep() (obviously).
For this purposes there are event-loop libraries, like reactphp/event-loop, which allows you to handle multiple streams in a non-blocking fashion and also has timer/ interval feature.
4) Non-blocking timeouts in PHP are possible.
It can be implemented by means of stream_select() function with timeout parameter (see how it's implemented in reactphp/event-loop StreamSelectLoop::run()).
5) There are PHP extensions like libevent, libev, event which allow timers implementation (if you want to go hardcore)
Not really, but you could try the tick count function.
http://php.net/manual/en/class.evtimer.php is probably what you are looking for, you can have a function called during set intervals, similar to setInterval in javascript. it is a pecl extension, if you have whm/cpanel you can easily install it through the pecl software/extension installer page.
i hadn't noticed this question is from 2010 and the evtimer class started to be coded in 2012-2013. so as an update to an old question, there is now a class that can do this similar to javascripts settimeout/setinterval.
Warning: You should note that while the sleep command can make a PHP process hang, or "sleep" for a given amount of time, you'd generally implement visual delays within the user interface.
Since PHP is a server side language, merely writing its execution output (generally in the form of HTML) to a web server response: using sleep in this fashion will generally just stall or delay the response.
With that being said, sleep does have practical purposes. Delaying execution can be used to implement back off schemes, such as when retrying a request after a failed connection. Generally speaking, if you need to use a setTimeout in PHP, you're probably doing something wrong.
Solution: If you still want to implement setTimeout in PHP, to answer your question explicitly: Consider that setTimeout possesses two parameters, one which represents the function to run, and the other which represents the amount of time (in milliseconds). The following code would actually meet the requirements in your question:
<?php
// Build the setTimeout function.
// This is the important part.
function setTimeout($fn, $timeout){
// sleep for $timeout milliseconds.
sleep(($timeout/1000));
$fn();
}
// Some example function we want to run.
$someFunctionToExecute = function() {
echo 'The function executed!';
}
// This will run the function after a 3 second sleep.
// We're using the functional property of first-class functions
// to pass the function that we wish to execute.
setTimeout($someFunctionToExecute, 3000);
?>
The output of the above code will be three seconds of delay, followed by the following output:
The function executed!
if you need to make an action after you execute some php code you can do it with an echo
echo "Success.... <script>setTimeout(function(){alert('Hello')}, 3000);</script>";
so after a time in the client(browser) you can do something else, like a redirect to another php script for example or echo an alert
There is a Generator class available in PHP version > 5.5 which provides a function called yield that helps you pause and continue to next function.
generator-example.php
<?php
function myGeneratorFunction()
{
echo "One","\n";
yield;
echo "Two","\n";
yield;
echo "Three","\n";
yield;
}
// get our Generator object (remember, all generator function return
// a generator object, and a generator function is any function that
// uses the yield keyword)
$iterator = myGeneratorFunction();
OUTPUT
One
If you want to execute the code after the first yield you add these line
// get the current value of the iterator
$value = $iterator->current();
// get the next value of the iterator
$value = $iterator->next();
// and the value after that the next value of the iterator
// $value = $iterator->next();
Now you will get output
One
Two
If you minutely see the setTimeout() creates an event loop.
In PHP there are many libraries out there E.g amphp is a popular one that provides event loop to execute code asynchronously.
Javascript snippet
setTimeout(function () {
console.log('After timeout');
}, 1000);
console.log('Before timeout');
Converting above Javascript snippet to PHP using Amphp
Loop::run(function () {
Loop::delay(1000, function () {
echo date('H:i:s') . ' After timeout' . PHP_EOL;
});
echo date('H:i:s') . ' Before timeout' . PHP_EOL;
});
Check this Out!
<?php
set_time_limit(20);
while ($i<=10)
{
echo "i=$i ";
sleep(100);
$i++;
}
?>
Output:
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10

Speeding up a PHP App

I have a list of data that needs to be processed. The way it works right now is this:
A user clicks a process button.
The PHP code takes the first item that needs to be processed, takes 15-25 secs to process it, moves on to the next item, and so on.
This takes way too long. What I'd like instead is that:
The user clicks the process button.
A PHP script takes the first item and starts to process it.
Simultaneously another instance of the script takes the next item and processes it.
And so on, so around 5-6 of the items are being process simultaneously and we get 6 items processed in 15-25 secs instead of just one.
Is something like this possible?
I was thinking that I use CRON to launch an instance of the script every second. All items that need to be processed will be flagged as such in the MySQL database, so whenever an instance is launched through CRON, it will simply take the next item flagged to be processed and remove the flag.
Thoughts?
Edit: To clarify something, each 'item' is stored in a mysql database table as seperate rows. Whenever processing starts on an item, it is flagged as being processed in the db, hence each new instance will simply grab the next row which is not being processed and process it. Hence I don't have to supply the items as command line arguments.
Here's one solution, not the greatest, but will work fine on Linux:
Split the processing PHP into a separate CLI scripts in which:
The command line inputs include `$id` and `$item`
The script writes its PID to a file in `/tmp/$id.$item.pid`
The script echos results as XML or something that can be read into PHP to stdout
When finished the script deletes the `/tmp/$id.$item.pid` file
Your master script (presumably on your webserver) would do:
`exec("nohup php myprocessing.php $id $item > /tmp/$id.$item.xml");` for each item
Poll the `/tmp/$id.$item.pid` files until all are deleted (sleep/check poll is enough)
If they are never deleted kill all the processing scripts and report failure
If successful read the from `/tmp/$id.$item.xml` for format/output to user
Delete the XML files if you don't want to cache for later use
A backgrounded nohup started application will run independent of the script that started it.
This interested me sufficiently that I decided to write a POC.
test.php
<?php
$dir = realpath(dirname(__FILE__));
$start = time();
// Time in seconds after which we give up and kill everything
$timeout = 25;
// The unique identifier for the request
$id = uniqid();
// Our "items" which would be supplied by the user
$items = array("foo", "bar", "0xdeadbeef");
// We exec a nohup command that is backgrounded which returns immediately
foreach ($items as $item) {
exec("nohup php proc.php $id $item > $dir/proc.$id.$item.out &");
}
echo "<pre>";
// Run until timeout or all processing has finished
while(time() - $start < $timeout)
{
echo (time() - $start), " seconds\n";
clearstatcache(); // Required since PHP will cache for file_exists
$running = array();
foreach($items as $item)
{
// If the pid file still exists the process is still running
if (file_exists("$dir/proc.$id.$item.pid")) {
$running[] = $item;
}
}
if (empty($running)) break;
echo implode($running, ','), " running\n";
flush();
sleep(1);
}
// Clean up if we timeout out
if (!empty($running)) {
clearstatcache();
foreach ($items as $item) {
// Kill process of anything still running (i.e. that has a pid file)
if(file_exists("$dir/proc.$id.$item.pid")
&& $pid = file_get_contents("$dir/proc.$id.$item.pid")) {
posix_kill($pid, 9);
unlink("$dir/proc.$id.$item.pid");
// Would want to log this in the real world
echo "Failed to process: ", $item, " pid ", $pid, "\n";
}
// delete the useless data
unlink("$dir/proc.$id.$item.out");
}
} else {
echo "Successfully processed all items in ", time() - $start, " seconds.\n";
foreach ($items as $item) {
// Grab the processed data and delete the file
echo(file_get_contents("$dir/proc.$id.$item.out"));
unlink("$dir/proc.$id.$item.out");
}
}
echo "</pre>";
?>
proc.php
<?php
$dir = realpath(dirname(__FILE__));
$id = $argv[1];
$item = $argv[2];
// Write out our pid file
file_put_contents("$dir/proc.$id.$item.pid", posix_getpid());
for($i=0;$i<80;++$i)
{
echo $item,':', $i, "\n";
usleep(250000);
}
// Remove our pid file to say we're done processing
unlink("proc.$id.$item.pid");
?>
Put test.php and proc.php in the same folder of your server, load test.php and enjoy.
You will of course need nohup (unix) and PHP cli to get this to work.
Lots of fun, I may find a use for it later.
Use an external workqueue like Beanstalkd which your PHP script writes a bunch of jobs too. You have as many worker processes pulling jobs from beanstalkd and processing them as fast as possible. You can spin up as many workers as you have memory / CPU. Your job body should contain as little information as possible, maybe just some IDs which you hit the DB with. beanstalkd has a slew of client APIs and itself has a very basic API, think memcached.
We use beanstalkd to process all of our background jobs, I love it. Easy to use, its very fast.
There is no multithreading in PHP, however you can use fork.
php.net:pcntl-fork
Or you could execute a system() command and start another process which is multithreaded.
can you implementing threading in javascript on the client side? seems to me i've seen a javascript library (from google perhaps?) that implements it. google it and i'm sure you'll find something. i've never done it, but i know its possible. anyway, your client-side javascript could activate (ajax) a php script once for each item in separate threads. that might be easier than trying to do it all on the server side.
-don
If you are running a high traffic PHP server you are INSANE if you do not use Alternative PHP Cache: http://php.net/manual/en/book.apc.php . You do not have to make code modifications to run APC.
Another useful technique that can work along with APC is using the Smarty template system which allows you to cache output so that pages do not have to be rebuilt.
To solve this problem, I've used two different products; Gearman and RabbitMQ.
The benefit of putting your jobs into some sort of queuing software like Gearman or Rabbit is that you have multiple machines, they can all participate in processing items off the queue(s).
Gearman is easier to setup, so I'd suggest poking around with it a bit first. If you find you need something more heavy duty with queue robustness; Look into RabbitMQ
http://www.danga.com/gearman/
http://pear.php.net/package/Net_Gearman (PEAR library)
You can use pcntl_fork() and family to fork a process - however you may need something like IPC to communicate back to the parent process that the child process (the one you fork'd) is finished.
You could have them write to shared memory, like via memcache or a DB.
You could also have the child process write the completed data to a file, that the parent process keeps checking - as each child process completes the file is created/written to/updated, and parent process can grab it, one at a time, and them throw them back to the callee/client.
The parent's job is to control the queue, to make sure the same data isn't processed twice and also to sanity check the children (better kill that runaway process and start over...etc)
Something else to keep in mind - on windows platforms you are going to be severely limited - I dont even think you have access to pcntl_ unless you compiled PHP with support for it.
Also, can you cache the data once its been processed, or is it unique data every time? that would surely speed things up..?

Categories