Execute Code in Parallel in PHP to minimize execution time - php

Initial Condition: I have code written in php file. initially i was executing code, it was taking 30 seconds to execute. In this file the code was called 5 times.
What will happen next:Let if i need to execute this code 50 times then it will take 300 seconds in one execution in browser.next for 500 times 3000 secs. So it is serial execution of code.
What I Need: i need to execute this code in parallel. like several instance. So i would like to minimize the execution time so user has not wait for such long time.
What I Did: i used PHP CURL to execute this code parallel. I called this file several times to minimize the execution time.
So I want to know that is this method is correct. How much CURL i can execute and how much resources it require. It need a better method that how could i execute this code in parallel with tutorial.
any help will be grateful.

Probably the simplest option without changing your code (too much), though, would be to call PHP through the command line and not CURL. This cuts the overhead of APACHE (both in memory and speed), networking etc. Plus Curl is not a portable option as some servers can't see themselves (in network terms).
$process1 = popen('php myfile.php [parameters]');
$process2 = popen('php myfile.php [parameters]');
// get response from children : you can loop until all completed
$response1 = stream_get_contents($process1);
$response2 = stream_get_contents($process2);
You'll need to remove any reference to apache added variables in $_SERVER, and replace $_GET with argv/argc references. Both otherwise it should just work.
But the best solution will probably be pThreads (http://php.net/manual/en/book.pthreads.php) that allow you to do what you want. Will require some editing of code (and installing, possibly) but does what you're asking.

php curl is low enough overhead to not have to worry about it. If you can make loopback calls to a server farm through a load balancer, that's a good use case for curl. I've also used pcntl_fork() for same-host parallelism, but it's harder to set up. I've written classes built on both; see my php lib at https://github.com/andrasq/quicklib for ideas (or just borrow code, it's open source)

Consider using Gearman. Documentation :
http://php.net/manual/en/book.gearman.php

Related

Optimize PHP CURL for web crawler

I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds.
How can I optimize this and reduce the time required to fetch a page?
You can use curl_multi_* for that. The amount of curl resources you append to one multi handle is the amount of parallel requests it will do. I usually start with 20-30 threads, depending on the size of returned content (make sure your script won't terminate on memory limit).
Note, that it will run as long as it takes to run the slowest request. So if a request times out, you might wait for very long. To avoid that, it can be a good idea to set timeout to some acceptable value.
You can see the code example at my answer in another thread here.

max_execution_time Alternative

So here's the lowdown:
The client i'm developing for is on HostGator, which has limited their max_execution_time to 30 seconds and it cannot be overridden (I've tried and confirmed it cannot be via their support and wiki)
What I'm have the code doing is take an uploaded file and...
loop though the xml
get all feed download links within the file
download each xml file
individually loop though each xml array of each file and insert the information of each item into the database based on where they are from (i.e. the filename)
Now is there any way I can queue this somehow or split the workload into multiple files possibly? I know the code works flawlessly and checks to see if each item exists before inserting it but I'm stuck getting around the execution_limit.
Any suggestions are appreciated, let me know if you have any questions!
The timelimit is in effect only when executing PHP scripts through a webserver, if you execute the script from CLI or as a background process, it should work fine.
Note that executing an external script is somewhat dangerous if you are not careful enough, but it's a valid option.
Check the following resources:
Process Control Extensions
And specifically:
pcntl-exec
pcntl-fork
Did you know you can trick the max_execution_time by registering a shutdown handler? Within that code you can run for another 30 seconds ;-)
Okay, now for something more useful.
You can add a small queue table in your database to keep track of where you are in case the script dies mid-way.
After getting all the download links, you add those to the table
Then you download one file and process it; when you're done, you check them off (delete from) from the queue
Upon each run you check if there's still work left in the queue
For this to work you need to request that URL a few times; perhaps use JavaScript to keep reloading until the work is done?
I am in such a situation. My approach is similar to Jack's
accept that execution time limit will simply be there
design the application to cope with sudden exit (look into register_shutdown_function)
identify all time-demanding parts of the process
continuously save progress of the process
modify your components so that they are able to start from arbitrary point, e.g. a position in a XML file or continue downloading your to-be-fetched list of XML links
For the task I made two modules, Import for the actual processing; TaskManagement for dealing with these tasks.
For invoking TaskManager I use CRON, now this depends on what webhosting offers you, if it's enough. There's also a WebCron.
Jack's JavaScript method's advantage is that it only adds requests if needed. If there are no tasks to be executed, the script runtime will be very short and perhaps overstated*, but still. The downsides are it requires user to wait the whole time, not to close the tab/browser, JS support etc.
*) Likely much less demanding than 1 click of 1 user in such moment
Then of course look into performance improvements, caching, skipping what's not needed/hasn't changed etc.

Creating a long TCPDF document without timeout (so long running php process)

I'm building a feature of a site that will generate a PDF (using TCPDF) into a booklet of 500+ pages. The layout is very simple but just due to the number of records I think it qualifies as a "long running php process". This will only need to be done a handful of times per year and if I could just have it run in the background and email the admin when done, that would be perfect. Considered Cron but it is a user-generated type of feature.
What can I do to keep my PDF rendering for as long as it takes? I am "good" with PHP but not so much with *nix. Even a tutorial link would be helpful.
Honestly you should avoid doing this entirely from a scalability perspective. I'd use a database table to "schedule" the job with the parameters, have a script that is continuously checking this table. Then use JavaScript to poll your application for the file to be "ready", when the file is ready then let the JavaScript pull down the file to the client.
It will be incredibly hard to maintain/troubleshoot this process while you're wondering why is my web server so slow all of a sudden. Apache doesn't make it easy to determine what process is eating up what CPU.
Also by using a database you can do things like limit the number of concurrent threads, or even provide faster rendering time by letting multiple processes render each PDF page and then re-assemble them together with yet another process... etc.
Good luck!
What you need is to change the allowed maximum execution time for PHP scripts. You can do that by several means from the script itself (you should prefer this if it would work) or by changing php.ini.
BEWARE - Changing execution time might seriously lower the performance of your server. A script is allowed to run only a certain time (30sec by default) before it is terminated by the parser. This helps prevent poorly written scripts from tying up the server. You should exactly know what you are doing before you do this.
You can find some more info about:
setting max-execution-time in php.ini here http://www.php.net/manual/en/info.configuration.php#ini.max-execution-time
limiting the maximum execution time by set_time_limit() here http://php.net/manual/en/function.set-time-limit.php
PS: This should work if you use PHP to generate the PDF. It will not work if you use some stuff outside of the script (called by exec(), system() and similar).
This question is already answered, but as a result of other questions / answers here, here is what I did and it worked great: (I did the same thing using pdftk, but on a smaller scale!)
I put the following code in an iframe:
set_time_limit(0); // ignore php timeout
//ignore_user_abort(true); // optional- keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean();// remove output buffers
ob_implicit_flush(true);
This avoided the page load timeout. You might want to put a countdown or progress bar on the parent page. I originally had the iframe issuing progress updates back to the parent, but browser updates broke that.

PHP Background Process on BSD uses 100% CPU

I have a PHP script that runs as a background process. This script simply uses fopen to read from the Twitter Streaming API. Essentially an http connection that never ends. I can't post the script unfortunately because it is proprietary. The script on Ubuntu runs normally and uses very little CPU. However on BSD the script always uses nearly a 100% CPU. The script is working just fine on both machines and is the exact same script. Can anyone think of something that might point me in the right direction to fix this? This is the first PHP script I have written to consistently run in the background.
The script is an infinite loop, it reads the data out and writes to a json file every minute. The script will write to a MySQL database whenever a reconnect happens, which is usually after days of running. The script does nothing else and is not very long. I have little experience with BSD or writing PHP scripts that run infinite loops. Thanks in advance for any suggestions, let me know if this belongs in another StackExchange. I will try to answer any questions as quickly as possible, because I realize the question is very vague.
Without seeing the script, this is very difficult to give you a definitive answer, however what you need to do is ensure that your script is waiting for data appropriately. What you should absolutely definitely not do is call stream_set_timeout($fp, 0); or stream_set_blocking($fp, 0); on your file pointer.
The basic structure of a script to do something like this that should avoid racing would be something like this:
// Open the file pointer and set blocking mode
$fp = fopen('http://www.domain.tld/somepage.file','r');
stream_set_timeout($fp, 1);
stream_set_blocking($fp, 1);
while (!feof($fp)) { // This should loop until the server closes the connection
// This line should be pretty much the first line in the loop
// It will try and fetch a line from $fp, and block for 1 second
// or until one is available. This should help avoid racing
// You can also use fread() in the same way if necessary
if (($str = fgets($fp)) === FALSE) continue;
// rest of app logic goes here
}
You can use sleep()/usleep() to avoid racing as well, but the better approach is to rely on a blocking function call to do your blocking. If it works on one OS but not on another, try setting the blocking modes/behaviour explicitly, as above.
If you can't get this to work with a call to fopen() passing a HTTP URL, it may be a problem with the HTTP wrapper implementation in PHP. To work around this, you could use fsockopen() and handle the request yourself. This is not too difficult, especially if you only need to send a single request and read a constant stream response.
It sounds to me like one of your functions is blocking briefly on Linux, but not BSD. Without seeing your script it is hard to get specific, but one thing I would suggest is to add a usleep() before the next loop iteration:
usleep(100000); //Sleep for 100ms
You don't need a long sleep... just enough so that you're not using 100% CPU.
Edit: Since you mentioned you don't have a good way to run this in the background right now, I suggest checking out this tutorial for "daemonizing" your script. Included is some handy code for doing this. It can even make a file in init.d for you.
How does the code look like that does the actual reading? Do you just hammer the socket until you get something?
One really effective way to deal with this is to use the libevent extension, but that's not for the feeble minded.

Faster alternative to file_get_contents()

Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone
There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.
As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init
Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.
it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.
I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.
If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.

Categories