In PHP, how do I read an unreliable web page?

In PHP, how do I read an unreliable web page? - php

I'm trying to use Curl in PHP to read a unreliable web page. The page is often unavailable because of server errors. However, I still need to read it if it's available. Additionally, I don't want the unreliability of the web page to effect my code. I would like my PHP to fail gracefully and move on. Here is what I have so far:
<?php
function get_url_contents($url){
$crl = curl_init();
$timeout = 2;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$handle = get_url_contents ( 'http://www.mydomain.com/mypage.html' );
?>

Use this instead, CURL is not super recommanded anymore as i've heard since PHP wrappers offer much better performance and are always available anywhere you go:
$currentcontext = stream_context_get_default();
stream_context_set_default(stream_context_create(array('timeout' => 2)));
$content = file_get_contents('url', $context);
stream_context_set_default($currentcontext);
This will set the default stream context to timeout after 2 seconds and get the content of the url via a stream wrapper that should be there in all php versions from 5.2 and up for sure;
You are not obligated to restore the default context depending on your site's code but it's always a good thing to do. If you don't, then this operation can be achieved in only 2 lines of code...

You could test the HTTP code to see if the page was successfully retrieved by testing the HTTP Response code. I can't remember if >200 and <302 are the correct code ranges though, have a quick peak at http response codes If you use this method.
<?php
function get_url_contents($url){
$crl = curl_init();
$timeout = 2;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret['pagesource'] = curl_exec($crl);
$httpcode = curl_getinfo($crl, CURLINFO_HTTP_CODE);
curl_close($crl);
if($httpcode >=200 && $httpcode<302) {
$ret['response']=true;
} else {
$ret['response']=false;
}
return $ret;
}
$handle = get_url_contents ( 'http://192.168.1.118/newTest/mainBoss.php' );
if($handle['response']==false){
echo 'page is no good';
} else {
echo 'page is ok and here it is:' . $handle['pagesource'] . 'DONE.<br>';
}
?>

Related

Limit php curl requests per sec for getting all links on the site page

I'm trying to get all links at concrete web-page. I need only 'a' tags with concrete parameter. But firstly, as far as I know, I have to download the whole page.
I use this (mostly not mine) code:
<?php
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Устанавливаем параметр, чтобы curl возвращал данные, вместо того, чтобы выводить их в браузер.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$startUrl = 'address';
$data = file_get_contents_curl($startUrl);
echo($data);
?>
By this I'm getting error "Too many requests". The question is: can I change the amount of requests for finging elements of links array?
I think about curl_multi, but as far as I understand, it assumes that I already have the array and only need to make multiple threads.
Help, please.

Check if download link is dead in php?

If I get title of the page, I can tell the download link is active or dead.
For example: "Free online storage" is title of dead link and "[file name]" is the title of active link (mediafire). But my page takes too long to respond, so is there any other way to check if a download link is active or dead?
That is what i have done:
<?php
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
?>

Do not perform a GET request, which downloads the whole page/file, but HEAD request, which gets only the HTTP headers, and check if the status is 200, and the content-type is not text/html

Something like this...
function url_validate($link)
{
#[url]http://www.example.com/determining-if-a-url-exists-with-curl/[/url]
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
$data = curl_exec($ch);
curl_close($ch);
preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
$code = end($matches[1]);
if(!$data)
{
return(false);
}
else
{
if($code==200)
{
return(true);
}
elseif($code==404)
{
return(false);
}
}
}
You can safely use any cURL library function. It is legitimate and thus would not regarded as a hacking attempt. The only requirement is that your web hosting company has cURL extension installed, which is very likely.

cURL should do the job. You can check the headers returned and the text content as well if you want.

cURL returns empty output from valid url - no errors reported

If you just enter the urls into the browser you can see that both work, cdon works even without javascript, have they blocked cURL somehow?
I'm trying to build a scraper to benifit legal movies online which would benifit them a whole lot, seems stupid blocking scrapers in general imho. Although I'm far from sure that's whats going on here! Might be just an error somewhere..
// Works
get_file1('http://sfanytime.com/sv-SE/Sokresultat/?field=all&q=The+Matrix', '/', 'sfanytime.html');
// Saves a blank 0 KB file
get_file1('http://downloads.cdon.com/index.phtml?action=search&search_terms=The+Matrix', '/', 'cdon.html');
function get_file1($file, $local_path, $newfilename) {
$out = fopen($newfilename, 'wb');
if ($out === FALSE) {
return false;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_FILE, $out);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_URL, $file);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$error = curl_error($ch);
if (strlen($error) > 0) {
echo "<br>Error is : ". $error;
return false;
}
curl_close($ch);
return true;
}

You should change the line
curl_setopt($ch, CURLOPT_FAILONERROR, true);
...to...
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
CURLOPT_FAILONERROR will cause a "silent fail" - which from what you say, is not what you want. I have replaced this with CURLOPT_FOLLOWLOCATION, because when I visit the second URL, I get redirected to a "choose your country" type page, which will be a response with an empty body - which is why you get an empty file.
There is no problem with your code as such, simply a problem with the way you handle the response from the second URL. You don't see an error because, technically, there wasn't one.

Reverse Geocoding Responses randomly in php

I have tried google reverse geocode.Following function called in for loop multiple times...this Works randomly...Sometimes response address perfectly..sometimes no response got...What is the problem here...
function reversegeo($ilatt,$ilonn)
{
$url1='http://maps.googleapis.com/maps/api/geocode/json?latlng='.$ilatt.','.$ilonn.'&sensor=false';
$ch1 = curl_init();
curl_setopt($ch1, CURLOPT_URL, $url1);
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch1, CURLOPT_REFERER, 'http://www.mywebsiteurl.com/Trackfiles/report.php');
$body1 = curl_exec($ch1);
curl_close($ch1);
$json1 = json_decode($body1);
$add=$json1->results[0]->formatted_address;
return $add;
}

You are probably hitting the server too often, or too fast. Add some delays in there with sleep().
Also when you say "no response got" you need to be more specific. Google will give an error code if you are hitting it too often, it won't just be nothing.
Instead of:
$body1 = curl_exec($ch1);
Do:
if(($body1 = curl_exec($ch1)) === false) {
echo 'Curl error: ' . curl_error($ch);
}

Check if a remote page exists using PHP?

In PHP, how can I determine if any remote file (accessed via HTTP) exists?

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
$data = curl_exec($ch);
curl_close($ch);
if (!$data) {
echo "Domain could not be found";
}
else {
preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
$code = end($matches[1]);
if ($code == 200) {
echo "Page Found";
}
elseif ($code == 404) {
echo "Page Not Found";
}
}
Modified version of code from here.

I like curl or fsockopen to solve this problem. Either one can provide header data regarding the status of the file requested. Specifically, you would be looking for a 404 (File Not Found) response. Here is an example I've used with fsockopen:
http://www.php.net/manual/en/function.fsockopen.php#39948

This function will return the response code (the last one in case of redirection), or false in case of a dns or other error. If one argument (the url) is supplied a HEAD request is made. If a second argument is given, a full request is made and the content, if any, of the response is stored by reference in the variable passed as the second argument.
function url_response_code($url, & $contents = null)
{
$context = null;
if (func_num_args() == 1) {
$context = stream_context_create(array('http' => array('method' => 'HEAD')));
}
$contents = #file_get_contents($url, null, $context);
$code = false;
if (isset($http_response_header)) {
foreach ($http_response_header as $header) {
if (strpos($header, 'HTTP/') === 0) {
list(, $code) = explode(' ', $header);
}
}
}
return $code;
}

I recently was looking for the same info. Found some really nice code here: http://php.assistprogramming.com/check-website-status-using-php-and-curl-library.html
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode >= 200 && $httpcode < 300){
return true;
}
else {
return false;
}
}
if(Visit("http://www.site.com")){
echo "Website OK";
}
else{
echo "Website DOWN";
}

Use Curl, and check if the request went through successfully.
http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/

Just a note that these solutions will not work on a site that does not give an appropriate response for a page not found. e.g I just had a problem with testing for a page on a site as it just loads a main site page when it gets a request it cannot handle. So the site will nearly always give a 200 response even for non-existent pages.
Some sites will give a custom error on a standard page and not still not give a 404 header.
Not much you can do in these situations unless you know the expected content of the page and start testing that the expected content exists or test for some expected error text within the page and that is all getting a bit messy...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

In PHP, how do I read an unreliable web page? - php

Related

Limit php curl requests per sec for getting all links on the site page

Check if download link is dead in php?

cURL returns empty output from valid url - no errors reported

Reverse Geocoding Responses randomly in php

Check if a remote page exists using PHP?

Categories

Resources