Process Timing when Scraping - php

I have a site that that scrapes off of it's sister sites but for reporting reason I'd like to be able to work out how long the task took to run. How would I approach with this with PHP, Is it even possible?
In an ideal world if the task couldn't connect to actually run after 5 seconds I'd like to kill the function from running and report the failure.
Thank you all!

If you use cURL for scraping, you can use the timeout function like this
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options including timeout
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // capture the result in a string
curl_setopt($ch, CURLOPT_TIMEOUT, 5); // The number of seconds to wait while trying to connect.
// grab the info
if (!$result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
// close cURL resource, and free up system resources
curl_close($ch);
// process the $result

Related

php script loading but doesnt get to end of file

The following script just seems to run forever. It never gets to finished.
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
for ($i = 500; $i<3000; i++){
$url = "http://abcedfg.com/$i/index.html";
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
Try to wrap curl_init and curl_close in every request.
Like this:
function callurl($myurl) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $myurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
$response = curl_exec ($ch);
curl_close ($ch);
return $response;
}
And You'll have to call this function for every URL for example using a loop for.
Also try to test with only 10-20 requests before to go BIG.
Consider that 2500 requests, if every request takes 1 second, is translated to 41 minutes of activity.
No server is configured by default to keep a PHP session active for 40min. You can change this settings on the server if You have access to the server.
It's also possible that You're stuck because the server doesn't have so much resources for making so much requests at the same time. Ideally You should fine tune Your server configuration in order to achieve better performance.
Also consider to use
curl_multi_init for better performance and asynchronous requests.
But this will not guarantee that the request will be dropped because of TIMEOUT. So fine tune the server could be still needed.
Check also this post for how to encrease the time Limit:
It's better to close the file, everytime you open it, so that it realese the memory for the open file.
You can list all the urls by running the loop, and then do a multicurl request.

php curl execution time limitation

While using PHP I am taking image links from my mysql database, and echoing them out. There are 600 or so, but it keeps stopping after running 100 or so. It is not a logic error, it seems there is a setting that is stopping php from continuing the curl. Please advise which setting I should expand to allow a longer CURL thanks!
Here is what I am using now:
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
$data = curl_exec($ch);
return $data;
}
$htmlaa = file_get_contents_curl($getimagefrom);
$docaa = new DOMDocument();
#$docaa->loadHTML($htmlaa);
Again, it is worknig just fine but just keeps stopping after running for maybe 3 minutes.
You can set the curl timeout like so:
curl_setopt($ch, CURLOPT_TIMEOUT, 1000); //seconds to live
Since there are multiple factors that influence execution time you should also check out these two as well:
http://php.net/manual/en/function.set-time-limit.php
http://php.net/manual/en/info.configuration.php#ini.max-execution-time
Also please note that CURLOPT_TIMEOUT defines the amount of time that any cURL function is allowed to take to execute. You should also checkout CURLOPT_CONNECTTIMEOUT option.

PHP cURL causing massive server load on error?

I'm currently making a PHP site that gets his data from an API. At first cURL seemed to do that perfectly, but if the API returns an empty response (and we make the request again, because we can't give a correct response without it) it seems to spawn a child process. This didn't use too much CPU when developing, but in production it can get as high as 150% CPU load.
Code used to get data from the API:
while (empty($output )) {
$ch = curl_init();
set_curl($ch);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($ch, CURLOPT_URL, "http://domain.com");
$output = curl_exec($ch);
}
Is there any way to fix this?

How to make a cUrl request without receiving the response?

Normally I Post data when I initiate cURL. And I wait for the response, parse it, etc...
I want to simply post data, and not wait for any response.
In other words, can I send data to a Url, via cURL, and close my connection immediately? (not waiting for any response, or even to see if the url exists)
It's not a normal thing to ask, but I'm asking anyway.
Here's what I have so far:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
curl_exec($ch);
curl_close($ch);
I believe the only way to not actually receive the whole response from the remote server is by using CURLOPT_WRITEFUNCTION. For example:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'do_nothing');
curl_exec($ch);
curl_close($ch);
function do_nothing($curl, $input) {
return 0; // aborts transfer with an error
}
Important notes
Be aware that this will generate a warning, as the transfer will be aborted.
Make sure that you do not set the value of CURLOPT_RETURNTRANSFER, as this will interfere with the write callback.
You could do this through the curl_multi_* functions that are designed to execute multiple simultaneous requests - just fire off one request and don't bother asking for the response.
Not sure what the implications are in terms of what will happen if the script exits and curl is still running.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $MyUrl);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_to_send);
$mh = curl_multi_init();
curl_multi_add_handle($mh,$ch);
$running = 'idc';
curl_multi_exec($mh,$running); // asynchronous
// don't bother with the usual cleanup
Not sure if this helps, but via command-line I suppose you could use the '--max-time' option - "Maximum time in seconds that you allow the whole operation to take."
I had to do something quick and dirty and didn't want to have to re-program code or wait for a response, so found the --max-time option in the curl manual
curl --max-time 1 URL

cURL Checking for Timeout

I know how to set the timeout in cURL but I want to alert the user that the request timed out.
I have created an ajax script that allows the user to request data from various insurance sites and aggregate into a list. If any of the insurance sites fail to respond within a certain time I want to alert the user that the current quote from that company is not available at the moment.
Does cURL return anything to signal a timeout?
curl_errno() returns 28 if the operation timed out. See http://curl.haxx.se/libcurl/c/libcurl-errors.html for other error codes.
Or another solution that can cover even more cases (server timed out, server errored out with a blank page) is to check if your get_url function result is different that "" or FALSE.
Example of get_url function :
function get_url($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$tmp = curl_exec($ch);
curl_close($ch);
return $tmp;
}

Categories