I'm checking for the presence of a xml site map on different URLs. If I supply a URL example.com/sitemap.xml, and it has a 301 to www.example.com/sitemap.xml, I get a 301 obviously. If www.example.com/sitemap.xml doesnt exist, I wont see the 404. So, if I get a 301, I execute another cURL to see if a 404 returns for www.example.com/sitemap.xml. But, for reason, I get random 404 and 303 status codes.
private function check_http_status($domain,$file){
$url = $domain . "/" . $file;
$curl = new Curl();
$curl->url = $url;
$curl->nobody = true;
$curl->userAgent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.1) Gecko/20060601 Firefox/2.0.0.1 (Ubuntu-edgy)';
$curl->execute();
$retcode = $curl->httpCode();
if ($retcode == 301 || $retcode == 302){
$url = "www." . $domain . "/" . $file;
$curl = new Curl();
$curl->url = $url;
$curl->nobody = true;
$curl->userAgent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.1) Gecko/20060601 Firefox/2.0.0.1 (Ubuntu-edgy)';
$curl->execute();
$retcode = $curl->httpCode();
}
return $retcode;
}
Have a look at the list of response codes returned - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html.
Usually a web browser will automatically handle these, but as you are doing things manually with curl, you need to understand what each response means. The 301 or 302 means that you should use the alternative url supplied to access the resource. This may be a simple as addin www to the request but it also may be more complex as a redirect to a different domain altogather.
The 303 means that you are using a POST attempt to access the resource, and should use GET.
Well, when you receive a 301 or 302 you should use the location found in the response, not just assume another location and try that.
As you can see in this example, the response from the server contains the new location of the file. Use that for your next request:
http://en.wikipedia.org/wiki/HTTP_301#Example
"followLocation" works very well. Here is how I implemented it:
$url = "http://www.YOURSITE.com//"; // Assign you url here.
$ch = curl_init(); // initialize curl.
curl_setopt($ch, CURLOPT_URL, $url); // Pass the URL as the option/target.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 0 will print html. 1 does not.
curl_setopt($ch, CURLOPT_HEADER, 0); // Please curl, inlude the header in the output.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // ..and yes, follow what the server sends as part of the HTTP header.
$response_data = curl_exec($ch); // execute curl with the target URL.
$http_header = curl_getinfo($ch); // Gets information about the last transfer i.e. our URL
// Print the URLs that are not returning 200 Found.
if($http_header['http_code'] != "200") {
echo " <b> PAGE NOT FOUND => </b>"; print $http_header['http_code'];
}
// print $http_header['url']; // Print the URL sent back in the header. This will print the page to wich you were redirected.
print $url; // this will print the original URLs that you are trying to access
curl_close($ch); // we are done with curl; so let's close it.
Related
I have a function that I use to test is the URL is valid before I store it in my db.
function url_exists($url)
{
ini_set("default_socket_timeout","5");
set_time_limit(5);
$f = fopen($url, "r");
$r = fread($f, 1000);
fclose($f);
return strlen($r) > 1;
}
if( !url_exists($test['urlRedirect']) ) { ... }
It works great, however one of my users reported an issue today and when I tested, indeed the following URL was flagged as invalid:
http://www.artleaguehouston.org/charge-grant-survey
So I tried to remove the page name and use only the domain and still got the error. What is it about this domain that my script chokes on?
You try to eat soup with a swiss knife there!
PHP supports URL wrappers in file_exists:
if (file_exists("http://www.artleaguehouston.org/charge-grant-survey")) {
// URL returns a good status code for your IP and User Agent "PHP/x.x.x"
}
CURL:
$ch = curl_init('http://www.artleaguehouston.org/charge-grant-survey');
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0'
);
curl_exec($ch);
$statusCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($statusCode == 200) {
// Site up and good status code
}
(Mostly taken from How can one check to see if a remote file exists using PHP? , just to give correct credit).
I am currently attempting to configure a CURL & PHP function found online that when called checks if the HTTP response headers is in the 200-300 range to determine if the web page is up. This is successful once ran against an individual website with the code below (not the function itself but the if statements etc) The function returns true or false depending on the range of the HTTP Response header:
$page = "www.google.com";
$page = gzdecode($page);
if (Visit($page))
{
echo $page;
echo " Is OK <br>";
}
else
{
echo $page;
echo " Is DOWN <br>";
}
However when running against an array of URL's stored within the script through the use of a for each loop it reports every webpage within the list as down despite that the code is the same bar the added for loop of course.
Does anyone know what the issue may be surrounding this?
Edit - adding Visit function
My bad sorry, not fully thinking.
The visit function is the following:
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch,CURLOPT_SSLVERSION,3);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<310) return true;
else return false;
}
The foreach loop as mentioned looks like this:
foreach($Urls as $URL)
{
$page = $URL;
$page = gzdecode($page);
if (Visit($page))
The if loop for the visit part is the same as before.
$page = $URL;
$page = gzdecode($page);
Why are you trying to uncompress the non-compressed URL? Assuming you really meant to uncompress the content returned from the URL, why would the remote server server compress it when you you've told it that the client does not support compression? Why are you fetching the entire page to see the headers?
The code you've shown us here has never worked
I get the following Error:
Warning:
file_get_contents(https://www.readability.com/api/content/v1/parser?url=http://www.redmondpie.com/ps1-and-ps2-games-will-be-playable-on-playstation-4-very-soon/?utm_source=dlvr.it&utm_medium=twitter&token=MYAPIKEY)
[function.file-get-contents]: failed to open stream: HTTP request
failed! HTTP/1.1 404 NOT FOUND in
/home/DIR/htdocs/readability.php
on line 23
With some Echoes I got the URL parsed by the function and it is fine and valid, I do the request from my Browser and it is OK.
The thing is that I get the Error Above with file_get_contents and I really don't understand why.
The URL is Valid and the Function is NOT Blocked by the Free Hosting Service (So I don't need Curl).
If someone could spot the error in my Code, I would appreciate it!
Thanks...
Here is my Code:
<?php
class jsonRes{
public $url;
public $author;
public $url;
public $image;
public $excerpt;
}
function getReadable($url){
$api_key='MYAPIKEY';
if(isset($url) && !empty($url)){
// I tried changing to http, no 'www' etc... -THE URL IS VALID/The browser opens it normally-
$requesturl='https://www.readability.com/api/content/v1/parser?url=' . urlencode($url) . '&token=' . $api_key;
$response = file_get_contents($requesturl); // * here the code FAILS! *
$g = json_decode($response);
$article_link=$g->url;
$article_author='';
if($g->author != null){
$article_author=$g->author;
}
$article_url=$g->url;
$article_image='';
if($g->lead_image_url != null){
$article_image=$g->lead_image_url;
}
$article_excerpt=$g->excerpt;
$toJSON=new jsonRes();
$toJSON->url=$article_link;
$toJSON->author=$article_author;
$toJSON->url=$article_url;
$toJSON->image=$article_image;
$toJSON->excerpt->$article_excerpt;
$retJSONf=json_encode($toJSON);
return $retJSONf;
}
}
?>
Sometimes a website will block crawlers(from remote servers) from getting to their pages.
What they do to work around this is spoof a browsers headers. Like pretend to be Mozilla Firefox instead of the sneaky PHP web scraper they are.
This is a function which uses the cURL library to do just that.
function get_data($url) {
$userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
else{
return $html;
}
//End of cURL function
}
One would then call it as below:
$response = get_data($requesturl);
Curl offers much more options in fetching of remote content and error checking than file_get_contents does. If you even want to customize it further, check out the list of cURL options here - Abridged list of cURL options
I have the following code for cURL using PHP;
$product_id_edit="Playful Minds (1062)";
$item_description_edit="TEST";
$rank_edit="0";
$price_type_edit="2";
$price_value_edit="473";
$price_previous_value_edit="473";
$active_edit="1";
$platform_edit="ios";
//set POST variables
$url = 'https://www.domain.com/adm_test/phpgen/offline_items.php?operation=insert';
$useragent = 'Mozilla/5.0 (Windows NT 6.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1';
$fields = array(
'product_id_edit'=>urlencode($product_id_edit),
'item_description_edit'=>urlencode($item_description_edit),
'rank_edit'=>urlencode($rank_edit),
'price_type_edit'=>urlencode($price_type_edit),
'price_value_edit'=>urlencode($price_value_edit),
'price_previous_value_edit'=>urlencode($price_previous_value_edit),
'active_edit'=>urlencode($active_edit),
'platform_edit'=>urlencode($platform_edit)
);
$fields_string="";
//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string,'&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
//add useragent
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);
curl_setopt($ch,CURLOPT_POST,count($fields));
//execute post
$result = curl_exec($ch);
if(curl_errno($ch)){
print "" . curl_error($ch);
}else{
//print_r($result);
}
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
//echo "HTTP Response Code: " . curl_error($ch);
echo $httpCode;
//close connection
curl_close($ch);
I have $httpCode printed; I get the code 200; I presume this is OK as I have read in the Manual Pages, however, when I check against the site, the POSTed values does not exist,
does this have something to do with cross-domains as I am not posting it on the same domain?, I'm doing it on 127.0.0.1/site/scrpt.php but how do I get the response code 200 if its not successful?
I also tried to get a 404 which I did by removing a part on the request URL it did return a 404 (this means that cURL is working properly in my assumption)
Does having the url https://www.domain.com/adm_test/phpgen/offline_items.php?operation=insert with the "?operation=insert" has something to do with it?
Let's presume(tho not implied), I'm from another site and I want post values into the form of another website sort'a a robot. tho my objective does not imply any evil intentions, is it that I have to encode thousand lines of info if this is not doable.
Likewise, I don't need a response from the server (but if one is available, then its just fine)
The operation should be passed with CURLOPT_POSTFIELDS. Along with other paramters.
Cross-domain issue happens in case of browser. And your code seems to be a php server side code so this should not be an issue.
Not sure if this is the solution or the problem is different, but this line:
rtrim($fields_string,'&');
Should be this:
$fields_string = rtrim($fields_string,'&');
curl_setopt($ch,CURLOPT_POST,TRUE);
CURLOPT_POST - boolean, it's not a count of values, it's use post flag.
Code 200 indicates that the connection is set up correctly and received a response from the server, but it does not mean that the requested action has been implemented.
Print $result after request to see the response from a web server.
The curl_getinfo function returns a lot of metadata about the result of an HTTP request. However, for some reason it doesn't include the bit of information I want at the moment, which is the target URL if the request returns an HTTP redirection code.
I'm not using CURLOPT_FOLLOWLOCATION because I want to handle specific redirect codes as special cases.
If cURL can follow redirects, why can't it tell me what they redirect to when it isn't following them?
Of course, I could set the CURLOPT_HEADER flag and pick out the Location header. But is there a more efficient way?
This can be done in 4 steps:
Step 1. Initialise curl
curl_init($ch); //initialise the curl handle
//COOKIESESSION is optional, use if you want to keep cookies in memory
curl_setopt($this->ch, CURLOPT_COOKIESESSION, true);
Step 2. Get the headers for $url
curl_setopt($ch, CURLOPT_URL, $url); //specify your URL
curl_setopt($ch, CURLOPT_HEADER, true); //include headers in http data
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); //don't follow redirects
$http_data = curl_exec($ch); //hit the $url
$curl_info = curl_getinfo($ch);
$headers = substr($http_data, 0, $curl_info['header_size']); //split out header
Step 3. Check if you have the correct response code
if (!($curl_info['http_code']>299 && $curl_info['http_code']<309)) {
//return, echo, die, whatever you like
return 'Error - http code'.$curl_info['http_code'].' received.';
}
Step 4. Parse the headers to get the new URL
preg_match("!\r\n(?:Location|URI): *(.*?) *\r\n!", $headers, $matches);
$url = $matches[1];
Once you have the new URL you can then repeat steps 2-4 as often as you like.
You can simply use it: (CURLINFO_REDIRECT_URL)
$info = curl_getinfo($ch, CURLINFO_REDIRECT_URL);
echo $info; // the redirect URL without following it
as you mentioned, disable the CURLOPT_FOLLOWLOCATION option (before executing) and place my code after executing.
CURLINFO_REDIRECT_URL - With the CURLOPT_FOLLOWLOCATION option
disabled: redirect URL found in the last transaction, that should be
requested manually next. With the CURLOPT_FOLLOWLOCATION option
enabled: this is empty. The redirect URL in this case is available in
CURLINFO_EFFECTIVE_URL
Refrence
curl doesn't seem to have a function or option to get the redirect target, it can be extracted using various techniques:
From the response:
Apache can respond with a HTML page in case of a 301 redirect (Doesn't seem to be the case with 302's).
If the response has a format similar to:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
<hr>
<address>Apache/2.2.16 (Debian) Server at www.xxx.yyy Port 80</address>
</body></html>
You can extract the redirect URL using DOMXPath:
$i = 0;
foreach($urls as $url) {
if(substr($url,0,4) == "http") {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$result = #curl_exec($c);
$status = curl_getinfo($c,CURLINFO_HTTP_CODE);
curl_close($c);
$results[$i]['code'] = $status;
$results[$i]['url'] = $url;
if($status === 301) {
$xml = new DOMDocument();
$xml->loadHTML($result);
$xpath = new DOMXPath($xml);
$href = $xpath->query("//*[#href]")->item(0);
$results[$i]['target'] = $href->attributes->getNamedItem('href')->nodeValue;
}
$i++;
}
}
Using CURLOPT_NOBODY
There is a faster way however, as #gAMBOOKa points out; Using CURLOPT_NOBODY. This approach just sends a HEAD request instead of GET (not downloading the actual content, so it should be faster and more efficient) and stores the response header.
Using a regex the target URL can be extracted from the header:
foreach($urls as $url) {
if(substr($url,0,4) == "http") {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_NOBODY,true);
curl_setopt($c, CURLOPT_HEADER, true);
$result = #curl_exec($c);
$status = curl_getinfo($c,CURLINFO_HTTP_CODE);
curl_close($c);
$results[$i]['code'] = $status;
$results[$i]['url'] = $url;
if($status === 301 || $status === 302) {
preg_match("#https?://([-\w\.]+)+(:\d+)?(/([\w/_\-\.]*(\?\S+)?)?)?#",$result,$m);
$results[$i]['target'] = $m[0];
}
$i++;
}
}
No there is no more efficient way
Your can use CURLOPT_WRITEHEADER + VariableStream
So.. you could write headers to variable and parse it
I had the same problem and curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); was of any help.
So, I decided not to use CURL but file_get_contents instead:
$data = file_get_contents($url);
$data = str_replace("<meta http-equiv=\"Refresh\" content=\"0;","<meta",$data);
The last line helped me to block the redirection although the product is not a clean html code.
I parsed the data and could retrieve the redirection URL I wanted to get.