PHP Multiple Curl Requests - php

I'm currently using Curl for PHP a lot. It takes a lot of time to get results of about 100 pages each time. For every request i'm using code like this
$ch = curl_init();
// get source
curl_close($ch);
What are my options to speed things up?
How should I use the multi_init() etc?

Reuse the same cURL handler ($ch) without running curl_close. This will speed it up just a little bit.
Use curl_multi_init to run the processes in parallel. This can have a tremendous effect.

take curl_multi - it is far better. Save the handshakes - they are not needed every time!

when i use code given in "http://php.net/curl_multi_init", response of 2 requests are conflicting.
But the code written in below link, returns each response separately (in array format)
https://stackoverflow.com/a/21362749/3177302

or take pcntl_fork, fork some new threads to execute curl_exec. But it's not as good as curl_multi.

Related

Execute Code in Parallel in PHP to minimize execution time

Initial Condition: I have code written in php file. initially i was executing code, it was taking 30 seconds to execute. In this file the code was called 5 times.
What will happen next:Let if i need to execute this code 50 times then it will take 300 seconds in one execution in browser.next for 500 times 3000 secs. So it is serial execution of code.
What I Need: i need to execute this code in parallel. like several instance. So i would like to minimize the execution time so user has not wait for such long time.
What I Did: i used PHP CURL to execute this code parallel. I called this file several times to minimize the execution time.
So I want to know that is this method is correct. How much CURL i can execute and how much resources it require. It need a better method that how could i execute this code in parallel with tutorial.
any help will be grateful.
Probably the simplest option without changing your code (too much), though, would be to call PHP through the command line and not CURL. This cuts the overhead of APACHE (both in memory and speed), networking etc. Plus Curl is not a portable option as some servers can't see themselves (in network terms).
$process1 = popen('php myfile.php [parameters]');
$process2 = popen('php myfile.php [parameters]');
// get response from children : you can loop until all completed
$response1 = stream_get_contents($process1);
$response2 = stream_get_contents($process2);
You'll need to remove any reference to apache added variables in $_SERVER, and replace $_GET with argv/argc references. Both otherwise it should just work.
But the best solution will probably be pThreads (http://php.net/manual/en/book.pthreads.php) that allow you to do what you want. Will require some editing of code (and installing, possibly) but does what you're asking.
php curl is low enough overhead to not have to worry about it. If you can make loopback calls to a server farm through a load balancer, that's a good use case for curl. I've also used pcntl_fork() for same-host parallelism, but it's harder to set up. I've written classes built on both; see my php lib at https://github.com/andrasq/quicklib for ideas (or just borrow code, it's open source)
Consider using Gearman. Documentation :
http://php.net/manual/en/book.gearman.php

Curl with multithreading

I am scraping data from an URL using cURL
for ($i = 0; $i < 1000000; $i++) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, 'http://example.com?page='.$i);
curl_exec($curl_handle);
curl_close($curl_handle);
// some code to save the HTML page on HDD
}
I wanted to know if there is some way that I could speed up the process? Maybe multithreading? How could I do it?
cURL Multi does not make parallel requests, it makes asynchronous requests.
The documentation was wrong until 5 minutes ago, it will take some time for the corrected documentation to be deployed and translated.
Asynchronous I/O (using something like the cURL Multi API) is the simplest thing to do, however, it can only make requests asynchronously; the processing of data once downloaded, for example writing to disk would still cause lots of blocking I/O, similarly further processing of the data (parsing json for example) would occur synchronously, in a single thread of execution.
Multi-threading is the other option, this requires that you have a thread safe build of PHP and the pthreads extension installed.
Multi-threading has the advantage that all processing can be done for each download and subsequent actions in parallel, fully utilizing all the CPU cores available.
What is best depends largely on how much processing of downloaded data your code must perform, and even then can be considered a matter of opinion.
You're looking for the curl_multi_* set of functions: "Allows the processing of multiple cURL handles in parallel".
Take a look at the complete example on the curl_multi_init() page.
Check out these articles for more information about how curl_multi_exec() works:
http://technosophos.com/2012/10/26/php-and-curlmultiexec.html
http://www.somacon.com/p537.php

Optimize PHP CURL for web crawler

I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds.
How can I optimize this and reduce the time required to fetch a page?
You can use curl_multi_* for that. The amount of curl resources you append to one multi handle is the amount of parallel requests it will do. I usually start with 20-30 threads, depending on the size of returned content (make sure your script won't terminate on memory limit).
Note, that it will run as long as it takes to run the slowest request. So if a request times out, you might wait for very long. To avoid that, it can be a good idea to set timeout to some acceptable value.
You can see the code example at my answer in another thread here.

using file get contents or curl

I was ask to use a simple facebook api to return the number of likes or shares at work which return json string.
Now since i am going to do this for a very large amount of links, which one is better:
Using file_get_contents or cURL.
Both of them seem to return the same results and cURL seems to be more complicated to use, but what is the difference among them. why do most people recommend using cURL over file_get_contents?
Before i run the api which might take a whole day to process, i will like to have feedback.
A few years ago I benchmarked the two and CURL was faster. With CURL you create one CURL instance which can be used for every request, and it maps directly to the very fast libcurl library. Using file_get_contents you have the overhead of protocol wrappers and the initialization code getting executed for every single request.
I will dig out my benchmark script and run on PHP 5.3 but I suspect that CURL will still be faster.
cURL supports https requests more widely than file_get_contents and it's not too terribly complicated. Although the one-line file_get_contents solution sure is clean looking, it's behind-the-scene overhead is larger than cURL.
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,$feedURL);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,true);
curl_setopt($curl_handle, CURLOPT_SSL_VERIFYPEER, false);
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);
This is what I use to make facebook api calls as many of them require an access_token and facebook will only accept access_token information in a secure connection. I've also noticed a large difference in execution time (cURL is much faster).

Faster alternative to file_get_contents()

Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone
There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.
As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init
Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.
it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.
I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.
If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.

Categories