cURL and get_file_contents blocked - php

I know this question has been dealt with on a few occasions but none of the fixes seem to work with my particular problem.
I am trying to grab any page from http://www.lewmar.com but some how they are managing to block all attempts. My latest script is as follows:
function curl_get_contents($url)
{
$ch = curl_init();
$browser_id = "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0";
$ip = $_SERVER["SERVER_ADDR"];
curl_setopt($ch, CURLOPT_USERAGENT, $browser_id);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $ip);
$headers = array();
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$headers[] = 'Accept-Language: en-US,en;q=0.5';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = 'http://www.lewmar.com';
$contents = curl_get_contents($url);
echo strlen($contents);
I have tried to replicate most of the headers and the site doesn't seem to check for 'Javascript' compatibility but yet still can't get anything returned.
Does anyone have any idea how they might be recognizing cURL and blocking.
Cheers

When you first visit that site it checks to see if you have a cookie. If you don't, it will send you one and send a redirect (to the same page). You haven't got anything in your code to store cookies so you end up going round in a circle. Curl gives up after 20 redirects. Solution: enable cookies!
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies2.txt');

Related

CURL returns the page with errors (JS, CSS, etc errors)

Actually I saw some similar cases in Stack Overflow but seems I have errors in my PHP code and can not show the page correctly. The Page I am trying to get is a resource from Pentaho BI. (Version 7.1.0.0.12). I tried many, many things, but nothing works.
Firstly I perform the authentication by 'Cookie-Based Authentication' (the method provided by Pentaho) -> Information: https://help.pentaho.com/Documentation/7.1/0R0/070/010/00A
In order to get the Cookie, I perform an HTTP POST by CURL PHP. That works well, I am able to get the Cookie from Pentaho.
Please check the code below;
$post_data['username'] = "suzy";
$post_data['password'] = "password";
foreach ($post_data as $key => $value) {
$post_items[] = $key . '=' . $value;
}
$post_string = implode('&', $post_items);
$curl_connection = curl_init('http://10.10.10.215:8080/pentaho/j_spring_security_check');
curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 3000 * 10);
curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl_connection, CURLOPT_POSTFIELDS, 'j_username=suzy&j_password=password');
curl_setopt($curl_connection, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded'));
$result = curl_exec($curl_connection);
$sessionID = explode("=", $url);
$cookie = $sessionID[1];
So, the variable $cookie contains the SESSIONID that I should use to access to the resource from Pentaho.
And then I perform an HTTP GET by CURL PHP in order to get the page (resource) from Pentaho.
This is the part that doesn't work. Actually what can I see is that PHP is "connected" to Pentaho by the URL and the Cookie I previously requested, and also the call returns the whole page, but when I displays the page in the browser it throws a lot of errors as I said before (JS, CSS errors and more).
Please check the code bellow;
$url = "http://10.10.10.215:8080/pentaho/api/repos/%3Apublic%3ASteel%20Wheels%3ADashboards%3ACTools_dashboard.wcdf/generatedContent";
$ch = curl_init();
$headers[] = 'Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Cache-Control: no-cache';
$headers[] = 'Pragma: no-cache';
$headers[] = 'Transfer-Encoding: chunked';
$headers[] = 'Accept-Language: nl-NL,nl;q=0.8,en_US;q=0.6,en;q=0.4';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept: text/plain,* /*;q=0.01';
$headers[] = 'Content-Type: text/html; charset=utf-8';
$headers[] = 'Cookie: JSESSIONID='.$sessionID[1];
$user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, '');
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$response = curl_exec($ch);
echo $response;die();
I'd like to clarify, I am able to get the page but with all of those erros I mentioned.
I have also tried to get this content into an IFRAME but couldn't do it. Is there any way to do it?
All information you can add is welcome! If you have some code that I can check, and more, as well.

cURL, script returning 503 Error (service unavailable)

i am trying to login on the page 'http://portal.demo.ascio.com/Logon.aspx' with cURL but i am getting this error "503 Service Unavailable
No server is available to handle this request. ".
I am sending right POST which i get from firebug by login to the normal page.
I have been searching for a while and figured out that maybe i am not sending right header.
Code is:
$url = 'http://portal.demo.ascio.com/Logon.aspx';
$login_string = 'THERE ARE MY POSTS WHICH CONTAINS PASSWORD...thats is definitelly right'
$headers = array();
$headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Connection: Keep-Alive";
$headers[] = "Host: portal.demo.ascio.com";
$headers[] = "Referer: http://portal.demo.ascio.com/Logon.aspx";
$headers[] = "Accept-Encoding: gzip, deflate";
$headers[] = "Accept-Language: en-US,en;q=0.5";
$headers[] = "Content-type: application/x-www-form-urlencoded";
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 5);
curl_setopt($ch, CURLOPT_POSTFIELDS, $login_string);
$result = curl_exec($ch);
print $result;
curl_close($ch);
Thx for help
In general, if you get the error of 500 series, such as 503, it means your request made it to the server side, and something in your input has caused the server to "break", maybe by design of the server itself. Double-check what you are sending and maybe also compare against the service API/Specs.

simulate browser with php curl not working in ebay

I am trying to read the following ebay webpage to PHP variable for processing:
http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidsLogin&_trksid=p2047675.l2564&rt=nc&item=321069150620
It shows fine in any modern browser without need to log in.
When I am trying to read the page to PHP variable with the following code:
$url="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidsLogin&_trksid=p2047675.l2564&rt=nc&item=321069150620";
$header = array();
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$string = curl_exec($ch);
curl_close ($ch);
echo $string;
I am getting the following page http://www.talumets.com/tmp/error.jpg that asks me to enter numbers from photos to continue. Sometimes my code works, but 95% time it asks me to enter numbers. I have also tried $string= file_get_contents( $url ), but the same problem. Any idea how to bypass this?
Thanks,
Tom
What you are seeing is ebay`s captcha protection against script such as yours. I dont think there is good way to bypass that.
You could try to limit your requests per second ratio and hope you will not trigger captcha
Ideal solution (if you dont want to use api) would be use of multiple servers with only few requests per second each.

file_get_contents()/curl getting unexpected page

I'm doing some scraping with php. I've been extracting data including link to the next relevant page so the whole thing is automatic. The problem is that I seem to be getting a page which is slightly modified compared to what I would expect using that URL in my browser (for e.g. the dates are different).
I've tried using curl and get_file_contents but both get the wrong file.
At the moment I am using:
$url = "http://www.example.com";
$ch = curl_init();
$timeout = 5;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
url_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$temp = curl_exec($ch);
curl_close($ch);
What is going on here?
UPDATE:
I've tried imitating a browser using the following code but still unsuccessful. I find this bizarre.
function get_url_contents($url){
$crl = curl_init();
$timeout = 10;
$header=array(
'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Encoding: gzip,deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($crl, CURLOPT_AUTOREFERER, FALSE);
curl_setopt ($crl, CURLOPT_FOLLOWLOCATION, FALSE);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
Further update:
Seems that the site is using my location to discriminate. Is there a locale option?
Can be many things...
Server may render pages differently based on cookies and header sent
Server may render pages differently based on existing pre-conditions and states on the server
You may have a proxy in between that modifies the content based on user-agent and since you don't have a specific user-agent (such as CURL browser) then your proxy is sending back different content
This is just a few things that could happen!

Getting error code 28 with cURL

THIS HAS BEEN SOLVED - SEE ANSWER AT THE END OF THIS POST
I am trying to retrieve data from a remote server using PHP / cURL
If I put the following URL into a browser the data comes back correctly.
http://realm103.c7.castle.wonderhill.com/api/map.json?user%5Fid=5245274&x=375&y=375&timestamp=1310554325&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=098d2deb0a37f18c97428d636c456572f9bade24&version=3
However when I try to access if with PHP / cURL it just times out (error code 28).
$json = curl($jsonurl, $realm['intRealmID'], $realm['strRealmServer']);
function curl($url, $realm, $realmServer){
$header = array();
$header[] = 'Host: realm'.strval($realm).'.'.$realmServer.'.castle.wonderhill.com';
$header[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Accept-Encoding: gzip,deflate';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
return curl_exec($ch);
curl_close($ch);
}
Anybody have any ideas why it works from the browser but not via cURL? Thanks
ADDITIONAL INFO
Whilst cURL isn't working for the URL above. For the URL below it works just fine. The only difference is the server the data is being requested from. The data itself and POST is identical.
http://realm4.c5.castle.wonderhill.com/api/map.json?user%5Fid=1053774&x=375&y=375&timestamp=1310616808&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=f35f476facab91f0e901eaf2209a0c8a9b9bedcc&version=3
ANSWER
Finally back to this and found that the referrer was the problem. The server was expecting to see no referrer in the request header. When it did the request was blocked. That behaviour probably was not consistent across all servers at the time but it is now. Removing the referrer from the request header and leaving everything else the same now works.
The biggest difference between your cURL function and requesting the information directly is the CURLOPT_HEADER property, I would first try removing this from the code.
try this
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('your url');
Alternatively, you can use the file_get_contents function remotely, but many hosts don't allow this
$userAgent = ‘Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0’;
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
Some other options I use:
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
try this:
$ctx = stream_context_create( array(
'socket' => array(
'bindto' => '192.168.0.107:0',
)
));
$c= file_get_contents('http://php.net', 0, $ctx);

Categories