I have to fetch multiple web pages, let's say 100 to 500. Right now I am using curl to do so.
function get_html_page($url) {
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//$output contains the output string
$html = curl_exec($ch);
//close curl resource to free up system resources
curl_close($ch);
return $html;
}
My major concern is the total time taken by my script to fetch all these web pages. I know that the time taken is directly proportional to my internet speed and hence the majority time is taken by $html = curl_exec($ch); function call.
I was thinking that instead of creating and destroying curl instance again and again for each and every web page, if I create it only once and then just reuse it for each and every page and finally in the end destroy it. Something like:
<?php
function get_html_page($ch, $url) {
//$output contains the output string
$html = curl_exec($ch);
return $html;
}
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
.
.
.
<fetch web pages using get_html_page()>
.
.
.
//close curl resource to free up system resources
curl_close($ch);
?>
Will it make any significant difference in the total time taken? If there is any other better approach then please let me know about it also?
How about trying to benchmark it? It may be more efficient to do it the second way, but I don't think it will add up to much. I'm sure your system can create and destroy curl instances in microseconds. It has to initiate the same HTTP connections each time either way, too.
If you were running many of these at the same time and were worried about system resources, not time, it might be worth exploring. As you noted, most of the time spent doing this will be waiting for network transfers, so I don't think you'll notice a change in overall time with either method.
For web scraping I would use : YQL + JSON + xPath. You'll implement it using cURL
I think you'll save a lot of resources.
Related
I have lots of websites, where i use a lot of includes on. Those files I include are on an external include-server. My problem is: I want to make those files redundant, so if the include server goes down, they are taken from my second include server.
Doing that manually on each website will take by far too long, so I wonder if there is a way to do it for instance on the server-side (so if the server is down it forwards to the other server).
Here is an example of how I usually include my files:
<?php
$url = 'http://myincludeserver.com/folder/fileiwanttoinclude.php';
function get_data($url)
{
$ch = curl_init($url);
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $_REQUEST);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data($url);
if(!empty($returned_content))
{
echo $returned_content;
}
else
{
include('includes/local_error_message.php');
};
?>
Thanks for reading!
Short answer:
You're more than likely going to want to refactor your code.
Longer answer:
If you truly want to do this at the server level then you're looking at implementing a "failover." You can read the wikipedia article, or this howto guide for a more in-depth explanation. To explain it simply, you would basically need 3 web servers:
Your include server
A backup server
A monitoring / primary server
It sounds like you've already got all three, but bullet three would ideally be a service provided through a third-party for extra redundancy to handle the DNS (there could still be downtime as DNS updates are being propagated). Of course, this introduces several gotchas that might have you end up refactoring anyway. For example you might run into load balancing challenges; your application now needs to consider shared resources between servers such as anything written to disk, sessions or databases. Tools like HAProxy can help.
The simpler option, especially if the domains associated with the includes are hidden from the user, is to refactor and simply replace bullet three with a script similar to your get_data function:
function ping($domain) {
$ch = curl_init($domain);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
return $response ? true : false;
}
$server1 = 'http://example.com';
$server2 = 'http://google.com';
if (ping($server1)) {
return $server1;
} else {
return $server2;
}
exit;
This would require you to update all of your files, but the good news is that you can automate the process by traversing all of your PHP files and replace the code via regex or by using a tokenizer. How you implement this option is entirely dependent on your actual code along with any differences between each site.
The only caveat here is that it could potentially double the hits to your server, so it would probably be better to use it in such a way that you're setting an environment or global variable and then have it execute periodically through cron.
I hope that helps.
I have a site that that scrapes off of it's sister sites but for reporting reason I'd like to be able to work out how long the task took to run. How would I approach with this with PHP, Is it even possible?
In an ideal world if the task couldn't connect to actually run after 5 seconds I'd like to kill the function from running and report the failure.
Thank you all!
If you use cURL for scraping, you can use the timeout function like this
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options including timeout
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // capture the result in a string
curl_setopt($ch, CURLOPT_TIMEOUT, 5); // The number of seconds to wait while trying to connect.
// grab the info
if (!$result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
// close cURL resource, and free up system resources
curl_close($ch);
// process the $result
I am using the following code, in PHP, to get the thumbnail (from the JSON) of a vimeo video according to the vimeo API
private function curl_get($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
$return = curl_exec($curl);
curl_close($curl);
return $return;
}
After profiling the page I noticed that it takes to curl_exec about 220 milliseconds, which I find a lot considering that I only want the thumbnail of the video.
Do you know a faster way to get the thumbnail?
it takes to curl_exec about 220 milliseconds
It's probably the network overhead (DNS lookup - connect - transfer - fetching the transferred data). It may not be possible to speed this up any further.
Make sure you are caching the results locally, so they don't have to be fetched anew every time.
For some reason my curl call is very slow. Here is the code I used.
$postData = "test"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);
$result = curl_exec($ch);
Executing this code takes on average 250ms to finish.
However when I just open the url in a browser, firebug says it only takes about 80ms.
Is there something I am doing wrong? Or is this the overhead associated with PHP Curl.
It's the call to
curl_exec
That is taking up all the time.
UPDATE:
So I figured out right after I posted this that if I set the curl option
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
It significantly slows down
curl_exec
The post data could be anything and it will slow it down.
Even if I set
curl_setopt($ch, CURLOPT_POST, false);
It's slow.
I'll try to work around it by just adding the parameters to the URI as a query string.
SECOND UPDATE:
Confirmed that if I just call the URI using GET and passing parameters
as a query string it is much faster than using POST and putting the parameters in the body.
CURL has some problems with DNS look-ups. Try using IP address instead of domain name.
Curl has the ability to tell exactly how long each piece took and where the slowness is (name lookup, connect, transfer time). Use curl_getinfo (http://www.php.net/manual/en/function.curl-getinfo.php) after you run curl_exec.
If curl is slow, it is generally not the PHP code, it's almost always network related.
try this
curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4 );
Adding "curl_setopt($ch, CURLOPT_POSTREDIR, CURL_REDIR_POST_ALL);" solved here. Any problem with this solution?
I just resolved this exact problem by removing the following two options:
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
Somehow on the site I was fetching, the POST request to over ten full seconds. If it's GET, it's less than a second.
So... in my wrapper function that does the Curl requests, it now only sets those two options when there is something in $postData
I just experienced a massive speed-up through compression. By adding the Accept-Encoding header to "gzip, deflate", or just to all formats which Curl supports, my ~200MB download took 6s instead of 20s:
curl_setopt($ch, CURLOPT_ENCODING, '');
Notes:
If an empty string, "", is set, a header containing all supported encoding types is sent.
you do not even have to care about decompression after the download, as this is done by Curl internally.
CURLOPT_ENCODING requires Curl 7.10+
The curl functions in php directly use the curl command line tool under *nix systems.
Therefore it really only depends on the network speed since in general curl itself is much faster than a webbrowser since it (by default) does not load any additional data like included pictures, stylesheets etc. of a website.
It might be possible that you are not aware, that the network performance of the server on which you were testing your php script is way worse than on your local computer where you were testing with the browser. Therefore both measurements are not really comparable.
generally thats acceptable when you are loading contents or posting to slower end of world. curl call are directly proportional to your network speed and throughput of your webserver
Could you please tell me which code samples uses the least RAM? Here are my two examples:
$ch = curl_init();
foreach ($URLS as $url){
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url.'&no_cache');
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
curl_exec($ch);
}
// close cURL resource, and free up system resources
curl_close($ch);
or
foreach ($URLS as $url){
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url.'&no_cache');
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
curl_exec($ch);
curl_close($ch);
}
// close cURL resource, and free up system resources
First one has lighter overhead, as you only instantiate the curl object once, but if curl has any leaks in it, and you're fetching a large-ish number of URLs, you could run out of memory.
Usually I only invoke a new curl object if the next url to fetch has too many differences in settings than the old curl. Easier to start with a default setup and make changes from that than try to "undo" the conflicting settings from the previous run.