simulate browser with php curl not working in ebay - php

I am trying to read the following ebay webpage to PHP variable for processing:
http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidsLogin&_trksid=p2047675.l2564&rt=nc&item=321069150620
It shows fine in any modern browser without need to log in.
When I am trying to read the page to PHP variable with the following code:
$url="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidsLogin&_trksid=p2047675.l2564&rt=nc&item=321069150620";
$header = array();
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$string = curl_exec($ch);
curl_close ($ch);
echo $string;
I am getting the following page http://www.talumets.com/tmp/error.jpg that asks me to enter numbers from photos to continue. Sometimes my code works, but 95% time it asks me to enter numbers. I have also tried $string= file_get_contents( $url ), but the same problem. Any idea how to bypass this?
Thanks,
Tom

What you are seeing is ebay`s captcha protection against script such as yours. I dont think there is good way to bypass that.
You could try to limit your requests per second ratio and hope you will not trigger captcha
Ideal solution (if you dont want to use api) would be use of multiple servers with only few requests per second each.

Related

cURL and get_file_contents blocked

I know this question has been dealt with on a few occasions but none of the fixes seem to work with my particular problem.
I am trying to grab any page from http://www.lewmar.com but some how they are managing to block all attempts. My latest script is as follows:
function curl_get_contents($url)
{
$ch = curl_init();
$browser_id = "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0";
$ip = $_SERVER["SERVER_ADDR"];
curl_setopt($ch, CURLOPT_USERAGENT, $browser_id);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $ip);
$headers = array();
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$headers[] = 'Accept-Language: en-US,en;q=0.5';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = 'http://www.lewmar.com';
$contents = curl_get_contents($url);
echo strlen($contents);
I have tried to replicate most of the headers and the site doesn't seem to check for 'Javascript' compatibility but yet still can't get anything returned.
Does anyone have any idea how they might be recognizing cURL and blocking.
Cheers
When you first visit that site it checks to see if you have a cookie. If you don't, it will send you one and send a redirect (to the same page). You haven't got anything in your code to store cookies so you end up going round in a circle. Curl gives up after 20 redirects. Solution: enable cookies!
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies2.txt');

why does "PHP Simple HTML DOM Parser" sometimes fail to parse an HTML body?

The library I'm using is: PHP Simple HTML DOM Parser v1.5.0
with some urls, it's str_get_html() and file_get_html() calls will return false. For example:
$html = HtmlDomParser::file_get_html('http://finance.yahoo.com/');
How can I fix this?
Simple HTML DOM parser supports invalid HTML.
But the issue can be with loading yahoo page, all kinds of forwarding on the page... everything that cURL can cope with.
So fast, dirty and probably the worst way to handle the situation is to use the workaround - loading page contents with cURL and then parsing the variable with the code with Simple HTML DOM.
Something like this:
<?php
include('simple_html_dom.php');
$search_url = 'http://finance.yahoo.com/';
function bCurl($url) {
$cookie_file = "cookie1.txt";
$header = array();
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$result = curl_exec($ch);
curl_close ($ch);
return $result;
}
$result = bCurl($search_url);
$html = new simple_html_dom();
$html->load($result);
$keywords = $html->find("meta[name=keywords]",0)->getAttribute('content');
print_r($keywords);
?>
Proper solution would probably be analyzing yahoo finance page, checking what's going on there, analyzing forwarding and anti-scraping mechanisms, javascript and flash objects... but hey, we all like instant, fast and dirty solutions like this one, right? ;)

cURL and Facebook problems

I am using cURL to access a facebook page. Locally it works perfect, but when I upload it to my dev server, it breaks and returns an empty string. I've checked and cURL is installed on the server. Here's the code I use to access facebook:
$header = array();
$header[] = 'Accept: text/json';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://facebook.com/feeds/page.php?format=json&id=135137236003');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$result = curl_exec($ch);
curl_close ($ch);
Any help is appreciated!
Change the accept header to */* or application/json as facebook is sending the response header as application/json.
And change this url
http://facebook.com/feeds/page.php?format=json&id=135137236003
to
http://www.facebook.com/feeds/page.php?format=json&id=135137236003
as facebook is redirecting the non-www request to www requests. Though it works for you as put follow location, but it saves one reound trip

Some way to access Facebook page data with PHP or JS

I have a Facebook page feed that I want to access:
http://facebook.com/feeds/page.php?format=json&id=123456789 (Not a real ID)
When I put the URL in the browser, it works just fine, but when I try to access it using file_get_contents, Facebook sends me to a page that says I am using an unsupported browser. This data is public though so I shouldn't need an access token to obtain it. Is there an extra step I need to take in order to access this data? I also tried using cURL with no success.
Any help is appreciated. Thanks.
You can use curl and mimic a browser, see this thread on how to do it.
$header = array();
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'YOUR URL HERE');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$result = curl_exec($ch);
curl_close ($ch);
echo $result;
Don't use this url, it is really really old method to access it. Use their relatively new Graph API

Getting error code 28 with cURL

THIS HAS BEEN SOLVED - SEE ANSWER AT THE END OF THIS POST
I am trying to retrieve data from a remote server using PHP / cURL
If I put the following URL into a browser the data comes back correctly.
http://realm103.c7.castle.wonderhill.com/api/map.json?user%5Fid=5245274&x=375&y=375&timestamp=1310554325&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=098d2deb0a37f18c97428d636c456572f9bade24&version=3
However when I try to access if with PHP / cURL it just times out (error code 28).
$json = curl($jsonurl, $realm['intRealmID'], $realm['strRealmServer']);
function curl($url, $realm, $realmServer){
$header = array();
$header[] = 'Host: realm'.strval($realm).'.'.$realmServer.'.castle.wonderhill.com';
$header[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Accept-Encoding: gzip,deflate';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
return curl_exec($ch);
curl_close($ch);
}
Anybody have any ideas why it works from the browser but not via cURL? Thanks
ADDITIONAL INFO
Whilst cURL isn't working for the URL above. For the URL below it works just fine. The only difference is the server the data is being requested from. The data itself and POST is identical.
http://realm4.c5.castle.wonderhill.com/api/map.json?user%5Fid=1053774&x=375&y=375&timestamp=1310616808&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=f35f476facab91f0e901eaf2209a0c8a9b9bedcc&version=3
ANSWER
Finally back to this and found that the referrer was the problem. The server was expecting to see no referrer in the request header. When it did the request was blocked. That behaviour probably was not consistent across all servers at the time but it is now. Removing the referrer from the request header and leaving everything else the same now works.
The biggest difference between your cURL function and requesting the information directly is the CURLOPT_HEADER property, I would first try removing this from the code.
try this
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('your url');
Alternatively, you can use the file_get_contents function remotely, but many hosts don't allow this
$userAgent = ‘Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0’;
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
Some other options I use:
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
try this:
$ctx = stream_context_create( array(
'socket' => array(
'bindto' => '192.168.0.107:0',
)
));
$c= file_get_contents('http://php.net', 0, $ctx);

Categories