As mentioned above, the php file_get_contents() function or even the fopen()/fread() combination stucks and times out when trying to read this simple image url:
http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png
but the same image is easily loaded by browsers, whats the catch?
EDITED:
as requested in comments, I am showing the function I used to get the data:
function customRead($url)
{
$contents = '';
$handle = fopen($url, "rb");
$dex = 0;
while ( !feof($handle) )
{
if ( $dex++ > 100 )
break;
$contents .= fread($handle, 2048);
}
fclose($handle);
echo "\nbreaking due to too many calls...\n";
return $contents;
}
I also tried simply this:
echo file_get_contents('http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png');
Both give the same issue
EDITED:
As suggested in comment I used curl:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.1 Safari/537.11');
$res = curl_exec($ch);
$rescode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch) ;
echo "\n\n\n[DATA:";
echo $res;
echo "]\n\n\n[CODE:";
print_r($rescode);
echo "]\n\n\n[ERROR:";
echo curl_error($ch);
echo "]\n\n\n";
this is the result:
[DATA:]
[CODE:0]
[ERROR:]
If you don't get the remote data with file_get_contents, you can try it with cURL as it can provide error messages on curl_error. If you get nothing, even no error, then something on your server blocks outgoing connections. Maybe you even want to try curl over SSH. I'm not sure if that makes any difference but it's worth the try. If you don't get anything you may want to consider contacting the server admin (if you're not that) or the provider.
Related
I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.
I am currently attempting to configure a CURL & PHP function found online that when called checks if the HTTP response headers is in the 200-300 range to determine if the web page is up. This is successful once ran against an individual website with the code below (not the function itself but the if statements etc) The function returns true or false depending on the range of the HTTP Response header:
$page = "www.google.com";
$page = gzdecode($page);
if (Visit($page))
{
echo $page;
echo " Is OK <br>";
}
else
{
echo $page;
echo " Is DOWN <br>";
}
However when running against an array of URL's stored within the script through the use of a for each loop it reports every webpage within the list as down despite that the code is the same bar the added for loop of course.
Does anyone know what the issue may be surrounding this?
Edit - adding Visit function
My bad sorry, not fully thinking.
The visit function is the following:
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch,CURLOPT_SSLVERSION,3);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<310) return true;
else return false;
}
The foreach loop as mentioned looks like this:
foreach($Urls as $URL)
{
$page = $URL;
$page = gzdecode($page);
if (Visit($page))
The if loop for the visit part is the same as before.
$page = $URL;
$page = gzdecode($page);
Why are you trying to uncompress the non-compressed URL? Assuming you really meant to uncompress the content returned from the URL, why would the remote server server compress it when you you've told it that the client does not support compression? Why are you fetching the entire page to see the headers?
The code you've shown us here has never worked
I get the following Error:
Warning:
file_get_contents(https://www.readability.com/api/content/v1/parser?url=http://www.redmondpie.com/ps1-and-ps2-games-will-be-playable-on-playstation-4-very-soon/?utm_source=dlvr.it&utm_medium=twitter&token=MYAPIKEY)
[function.file-get-contents]: failed to open stream: HTTP request
failed! HTTP/1.1 404 NOT FOUND in
/home/DIR/htdocs/readability.php
on line 23
With some Echoes I got the URL parsed by the function and it is fine and valid, I do the request from my Browser and it is OK.
The thing is that I get the Error Above with file_get_contents and I really don't understand why.
The URL is Valid and the Function is NOT Blocked by the Free Hosting Service (So I don't need Curl).
If someone could spot the error in my Code, I would appreciate it!
Thanks...
Here is my Code:
<?php
class jsonRes{
public $url;
public $author;
public $url;
public $image;
public $excerpt;
}
function getReadable($url){
$api_key='MYAPIKEY';
if(isset($url) && !empty($url)){
// I tried changing to http, no 'www' etc... -THE URL IS VALID/The browser opens it normally-
$requesturl='https://www.readability.com/api/content/v1/parser?url=' . urlencode($url) . '&token=' . $api_key;
$response = file_get_contents($requesturl); // * here the code FAILS! *
$g = json_decode($response);
$article_link=$g->url;
$article_author='';
if($g->author != null){
$article_author=$g->author;
}
$article_url=$g->url;
$article_image='';
if($g->lead_image_url != null){
$article_image=$g->lead_image_url;
}
$article_excerpt=$g->excerpt;
$toJSON=new jsonRes();
$toJSON->url=$article_link;
$toJSON->author=$article_author;
$toJSON->url=$article_url;
$toJSON->image=$article_image;
$toJSON->excerpt->$article_excerpt;
$retJSONf=json_encode($toJSON);
return $retJSONf;
}
}
?>
Sometimes a website will block crawlers(from remote servers) from getting to their pages.
What they do to work around this is spoof a browsers headers. Like pretend to be Mozilla Firefox instead of the sneaky PHP web scraper they are.
This is a function which uses the cURL library to do just that.
function get_data($url) {
$userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
else{
return $html;
}
//End of cURL function
}
One would then call it as below:
$response = get_data($requesturl);
Curl offers much more options in fetching of remote content and error checking than file_get_contents does. If you even want to customize it further, check out the list of cURL options here - Abridged list of cURL options
define('COOKIE', './cookie.txt');
define('MYURL', 'https://register.pandi.or.id/main');
function getUrl($url, $method='', $vars='', $open=false) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$header_array = array(
"Via: 1.1 register.pandi.or.id",
"Keep-Alive: timeout=15,max=100",
);
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$referer = 'https://register.pandi.or.id/main';
$ch = curl_init();
if ($method == 'post') {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "$vars");
}
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header_array);
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 5);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$buffer = curl_exec($ch);
if (curl_errno($ch)) {
echo "error " . curl_error($ch);
die;
}
curl_close($ch);
return $buffer;
}
function save_captcha($ch) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$url = "https://register.pandi.or.id/jcaptcha";
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$ch = curl_init(); // Initialize a CURL session.
curl_setopt($ch, CURLOPT_URL, $url); // Pass URL as parameter.
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return stream contents.
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); // We'll be returning this
$data = curl_exec($ch); // // Grab the jpg and save the contents in the
curl_close($ch); // close curl resource, and free up system resources.
$captcha_tmpfile = './captcha/captcha-' . rand(1000, 10000) . '.jpg';
$fp = fopen($tmpdir . $captcha_tmpfile, 'w');
fwrite($fp, $data);
fclose($fp);
return $captcha_tmpfile;
}
if (isset($_POST['captcha'])) {
$id = "yudohartono";
$pw = "mypassword";
$postfields = "navigation=authenticate&login-type=registrant&username=" . $id . "&password=" . $pw . "&captcha_response=" . $_POST['captcha'] . "press=login";
$url = "https://register.pandi.or.id/main";
$result = getUrl($url, 'post', $postfields);
echo $result;
} else {
$open = getUrl('https://register.pandi.or.id/main', '', '', true);
$captcha = save_captcha($ch);
$fp = fopen($tmpdir . "/cookie12.txt", 'r');
$a = fread($fp, filesize($tmpdir . "/cookie12.txt"));
fclose($fp);
<form action='' method='POST'>
<img src='<?php echo $captcha ?>' />
<input type='text' name='captcha' value=''>
<input type='submit' value='proses'>
</form>";
if (!is_readable('cookie.txt') && !is_writable('cookie.txt')) {
echo "cookie fail to read";
chmod('../pandi/', '777');
}
}
this cookie.txt
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
register.pandi.or.id FALSE / FALSE 0 JSESSIONID 05CA8241C5B76F70F364CA244E4D1DF4
after i submit form just display
HTTP/1.1 200 OK Date: Wed, 27 Apr 2011 07:38:08 GMT Server: Apache-Coyote/1.1 X-Powered-By: Servlet 2.4; Tomcat-5.0.28/JBoss-4.0.0 (build: CVSTag=JBoss_4_0_0 date=200409200418) Content-Length: 0 Via: 1.1 register.pandi.or.id Content-Type: text/plain X-Pad: avoid browser bug
if not error "Captcha invalid"
always failed login to pandi
what wrong in my script?
I'm not want to Break Captcha but i want display captcha and user input captcha from my web page, so user can registrar domain dotID from my web automaticaly
A captcha is intended to differentiate between humans and robots (programs). Seems like you are trying to log in with a program. The captcha seems to do its job :).
I don't see a legal way around.
It happens because,
You took your captcha image from first getURL (ie first curl_exec) and processed the captcha but to submit your captcha you are requested getURL (ie again curl_exec) which means to a new page with a new captcha again.
So you are placing the old captcha and putting it in the new captcha. I'm having the same problem & resolved it.
Captcha is a dynamic image created by the server when you hit the page. It will keep changing, you must extract the captcha from the page and then parse it and then submit your page for a login. Captcha will keep changing as and when the page is triggered to load!
Using a headless browsing solution this is possible. ie: zombie.js coffee.js on Node.. Also it may be possible to extract the "image" from the captcha and, using image recognition, "read" the image and convert it to text, which is then posted with the form.
As of today, the only surefire method to "trick" a captcha is to use headless browsing.
Yes, Andro Selva is right. On the second request it gives new captcha. Once it loads captcha with getUrl function and the second load is from the save_captcha function, so this are 2 different images.
It must do something like this:
Download the captcha image before close the curl and before post and tell the script to wait untill you provide captcha answer - I will use preg_match. It will require some javascript as well.
If the captcha image is generated from javascript, you need to execute this javascript with the same cookie or token. In this situation, the easier solution is to record the headers with e.g. livehttpheaders addon for mozila ffox.
With PHP I do not know how to do it, you have to get the captcha and find a way to solve it. It has a lot of algorithms to do it for you, but if you want to use java, I already hacked the source code from this link to get the code to solve the captcha and it works very well for a lot of captcha systems.
So, you could try to implement your own captcha solver, that will take a lot of time, try to find an existing implementation for PHP, or, IMHO, the best option, to use the JDownloader code base.
In PHP, how can I determine if any remote file (accessed via HTTP) exists?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
$data = curl_exec($ch);
curl_close($ch);
if (!$data) {
echo "Domain could not be found";
}
else {
preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
$code = end($matches[1]);
if ($code == 200) {
echo "Page Found";
}
elseif ($code == 404) {
echo "Page Not Found";
}
}
Modified version of code from here.
I like curl or fsockopen to solve this problem. Either one can provide header data regarding the status of the file requested. Specifically, you would be looking for a 404 (File Not Found) response. Here is an example I've used with fsockopen:
http://www.php.net/manual/en/function.fsockopen.php#39948
This function will return the response code (the last one in case of redirection), or false in case of a dns or other error. If one argument (the url) is supplied a HEAD request is made. If a second argument is given, a full request is made and the content, if any, of the response is stored by reference in the variable passed as the second argument.
function url_response_code($url, & $contents = null)
{
$context = null;
if (func_num_args() == 1) {
$context = stream_context_create(array('http' => array('method' => 'HEAD')));
}
$contents = #file_get_contents($url, null, $context);
$code = false;
if (isset($http_response_header)) {
foreach ($http_response_header as $header) {
if (strpos($header, 'HTTP/') === 0) {
list(, $code) = explode(' ', $header);
}
}
}
return $code;
}
I recently was looking for the same info. Found some really nice code here: http://php.assistprogramming.com/check-website-status-using-php-and-curl-library.html
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode >= 200 && $httpcode < 300){
return true;
}
else {
return false;
}
}
if(Visit("http://www.site.com")){
echo "Website OK";
}
else{
echo "Website DOWN";
}
Use Curl, and check if the request went through successfully.
http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/
Just a note that these solutions will not work on a site that does not give an appropriate response for a page not found. e.g I just had a problem with testing for a page on a site as it just loads a main site page when it gets a request it cannot handle. So the site will nearly always give a 200 response even for non-existent pages.
Some sites will give a custom error on a standard page and not still not give a 404 header.
Not much you can do in these situations unless you know the expected content of the page and start testing that the expected content exists or test for some expected error text within the page and that is all getting a bit messy...