Could you please tell me which code samples uses the least RAM? Here are my two examples:
$ch = curl_init();
foreach ($URLS as $url){
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url.'&no_cache');
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
curl_exec($ch);
}
// close cURL resource, and free up system resources
curl_close($ch);
or
foreach ($URLS as $url){
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url.'&no_cache');
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
curl_exec($ch);
curl_close($ch);
}
// close cURL resource, and free up system resources
First one has lighter overhead, as you only instantiate the curl object once, but if curl has any leaks in it, and you're fetching a large-ish number of URLs, you could run out of memory.
Usually I only invoke a new curl object if the next url to fetch has too many differences in settings than the old curl. Easier to start with a default setup and make changes from that than try to "undo" the conflicting settings from the previous run.
Related
The following script just seems to run forever. It never gets to finished.
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
for ($i = 500; $i<3000; i++){
$url = "http://abcedfg.com/$i/index.html";
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
Try to wrap curl_init and curl_close in every request.
Like this:
function callurl($myurl) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $myurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
$response = curl_exec ($ch);
curl_close ($ch);
return $response;
}
And You'll have to call this function for every URL for example using a loop for.
Also try to test with only 10-20 requests before to go BIG.
Consider that 2500 requests, if every request takes 1 second, is translated to 41 minutes of activity.
No server is configured by default to keep a PHP session active for 40min. You can change this settings on the server if You have access to the server.
It's also possible that You're stuck because the server doesn't have so much resources for making so much requests at the same time. Ideally You should fine tune Your server configuration in order to achieve better performance.
Also consider to use
curl_multi_init for better performance and asynchronous requests.
But this will not guarantee that the request will be dropped because of TIMEOUT. So fine tune the server could be still needed.
Check also this post for how to encrease the time Limit:
It's better to close the file, everytime you open it, so that it realese the memory for the open file.
You can list all the urls by running the loop, and then do a multicurl request.
I am calling an API with cURL to the same domain (right now its localhost) with the following code
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url );
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
It is very slow, (up to 7 seconds) unless I add
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,1);
I know its not simply the time it takes for the API to load, because if I request the API url in the browser, its almost instant.
How would you suggest troubleshooting this issue? Or should I not be using cURL at all?
I have a site that that scrapes off of it's sister sites but for reporting reason I'd like to be able to work out how long the task took to run. How would I approach with this with PHP, Is it even possible?
In an ideal world if the task couldn't connect to actually run after 5 seconds I'd like to kill the function from running and report the failure.
Thank you all!
If you use cURL for scraping, you can use the timeout function like this
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options including timeout
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // capture the result in a string
curl_setopt($ch, CURLOPT_TIMEOUT, 5); // The number of seconds to wait while trying to connect.
// grab the info
if (!$result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
// close cURL resource, and free up system resources
curl_close($ch);
// process the $result
I have to fetch multiple web pages, let's say 100 to 500. Right now I am using curl to do so.
function get_html_page($url) {
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//$output contains the output string
$html = curl_exec($ch);
//close curl resource to free up system resources
curl_close($ch);
return $html;
}
My major concern is the total time taken by my script to fetch all these web pages. I know that the time taken is directly proportional to my internet speed and hence the majority time is taken by $html = curl_exec($ch); function call.
I was thinking that instead of creating and destroying curl instance again and again for each and every web page, if I create it only once and then just reuse it for each and every page and finally in the end destroy it. Something like:
<?php
function get_html_page($ch, $url) {
//$output contains the output string
$html = curl_exec($ch);
return $html;
}
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
.
.
.
<fetch web pages using get_html_page()>
.
.
.
//close curl resource to free up system resources
curl_close($ch);
?>
Will it make any significant difference in the total time taken? If there is any other better approach then please let me know about it also?
How about trying to benchmark it? It may be more efficient to do it the second way, but I don't think it will add up to much. I'm sure your system can create and destroy curl instances in microseconds. It has to initiate the same HTTP connections each time either way, too.
If you were running many of these at the same time and were worried about system resources, not time, it might be worth exploring. As you noted, most of the time spent doing this will be waiting for network transfers, so I don't think you'll notice a change in overall time with either method.
For web scraping I would use : YQL + JSON + xPath. You'll implement it using cURL
I think you'll save a lot of resources.
Normally I Post data when I initiate cURL. And I wait for the response, parse it, etc...
I want to simply post data, and not wait for any response.
In other words, can I send data to a Url, via cURL, and close my connection immediately? (not waiting for any response, or even to see if the url exists)
It's not a normal thing to ask, but I'm asking anyway.
Here's what I have so far:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
curl_exec($ch);
curl_close($ch);
I believe the only way to not actually receive the whole response from the remote server is by using CURLOPT_WRITEFUNCTION. For example:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'do_nothing');
curl_exec($ch);
curl_close($ch);
function do_nothing($curl, $input) {
return 0; // aborts transfer with an error
}
Important notes
Be aware that this will generate a warning, as the transfer will be aborted.
Make sure that you do not set the value of CURLOPT_RETURNTRANSFER, as this will interfere with the write callback.
You could do this through the curl_multi_* functions that are designed to execute multiple simultaneous requests - just fire off one request and don't bother asking for the response.
Not sure what the implications are in terms of what will happen if the script exits and curl is still running.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
$mh = curl_multi_init();
curl_multi_add_handle($mh,$ch);
$running = 'idc';
curl_multi_exec($mh,$running); // asynchronous
// don't bother with the usual cleanup
Not sure if this helps, but via command-line I suppose you could use the '--max-time' option - "Maximum time in seconds that you allow the whole operation to take."
I had to do something quick and dirty and didn't want to have to re-program code or wait for a response, so found the --max-time option in the curl manual
curl --max-time 1 URL