I'm doing some scraping with php. I've been extracting data including link to the next relevant page so the whole thing is automatic. The problem is that I seem to be getting a page which is slightly modified compared to what I would expect using that URL in my browser (for e.g. the dates are different).
I've tried using curl and get_file_contents but both get the wrong file.
At the moment I am using:
$url = "http://www.example.com";
$ch = curl_init();
$timeout = 5;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
url_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$temp = curl_exec($ch);
curl_close($ch);
What is going on here?
UPDATE:
I've tried imitating a browser using the following code but still unsuccessful. I find this bizarre.
function get_url_contents($url){
$crl = curl_init();
$timeout = 10;
$header=array(
'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Encoding: gzip,deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($crl, CURLOPT_AUTOREFERER, FALSE);
curl_setopt ($crl, CURLOPT_FOLLOWLOCATION, FALSE);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
Further update:
Seems that the site is using my location to discriminate. Is there a locale option?
Can be many things...
Server may render pages differently based on cookies and header sent
Server may render pages differently based on existing pre-conditions and states on the server
You may have a proxy in between that modifies the content based on user-agent and since you don't have a specific user-agent (such as CURL browser) then your proxy is sending back different content
This is just a few things that could happen!
Related
I have a php script that periodically perform an http request to a remote api server.
I use curl to perform this task. My code is like the following exemple:
$url='http://apiserver.com/services/someservice.php?apikey='.$my_key;
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept-Encoding: gzip, deflate',
'Accept: */*',
'Connection: keep-alive',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:48.0) Gecko/20100101 Firefox/48.0'
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
$data=curl_exec($ch);
curl_close($ch);
$json_data = json_decode($data,TRUE);
Recently I verified that it no longer fetch new data, it is using instead a cached version. I tried to add the following curl flags to the code:
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);
curl_setopt($ch, CURLOPT_FORBID_REUSE, 1);
This did not solve the problem. I still receive the same cached response.
If I add an extra parameter to the api url like "&time=".time(), that fix the problem, but I dont want to add extra parameters to the url.
What can I do to solve the problem?
I know this question has been dealt with on a few occasions but none of the fixes seem to work with my particular problem.
I am trying to grab any page from http://www.lewmar.com but some how they are managing to block all attempts. My latest script is as follows:
function curl_get_contents($url)
{
$ch = curl_init();
$browser_id = "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0";
$ip = $_SERVER["SERVER_ADDR"];
curl_setopt($ch, CURLOPT_USERAGENT, $browser_id);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $ip);
$headers = array();
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$headers[] = 'Accept-Language: en-US,en;q=0.5';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = 'http://www.lewmar.com';
$contents = curl_get_contents($url);
echo strlen($contents);
I have tried to replicate most of the headers and the site doesn't seem to check for 'Javascript' compatibility but yet still can't get anything returned.
Does anyone have any idea how they might be recognizing cURL and blocking.
Cheers
When you first visit that site it checks to see if you have a cookie. If you don't, it will send you one and send a redirect (to the same page). You haven't got anything in your code to store cookies so you end up going round in a circle. Curl gives up after 20 redirects. Solution: enable cookies!
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies2.txt');
I have a script that downloads PDF files after logging into another site. It has so far worked great for all sites but I am now getting something strange with a new site I'm scraping: some of the files downloaded are 1kb (i.e it didn't work) while others work just fine. Using the download link in the browser opens the "do you want to save this file" window and the file is correct there.
here is my code (I include both the general curl parameters used throughout the scrape, and the final part where I try downloading the files):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
The files are under 1mb each. Any ideas? How can I see more details about why it's not working (e.g timeout)? Thanks!
I am trying to scrape a list of bills from a website after logging into it via curl but on one of the pages the content is not the same as in my browser (namely, instead of showing a list of bills it shows "Your bill history cannot be displayed"). I can correctly scrape other pages that are only available after login so I'm quite puzzled by why that page refuse to display the bill history when I use curl.
Here is my code:
//Load login page
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Submit post to login page to authentify
$postVariables = 'emailAddress='.$username.
'&password='.$password;
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login/POST.servlet');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/login');
$webpage = curl_exec($ch);
//Go to my account main page now that we are logged in
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/My_Account');
curl_setopt($ch, CURLOPT_REFERER, $target);
$webpage = curl_exec($ch); //shows the same content as in the browser
$accountNumber = return_between($webpage, 'id="accountID1">', '<', EXCL); //this is correctly found
//Go to bills page
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Bill_History/?accountnumber='.$accountNumber);
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/My_Account');
$webpage = curl_exec($ch); //Not showing the same content as in the browser
The last curl_exec being the one that doesn't work properly.
I have checked extensively the logic of the page and used Tamper Data to analyse what was going on: there doesn't seem to be any javascript / ajax call that would pull the bill history separately, and no POST request: as far as I can see the bill history should be displayed at page load.
Any ideas as to what I could try to fix it or what could be the problem? The fact that it works on other pages is especially puzzling.
Thanks in advance!
EDIT: it still doesn't work but I have found another page on their site where I can get what I need and where the content is displayed correctly - so no need for a solution anymore.
You might add additional header fields that "real" browsers usually transmit:
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
Just to name a few.
If you happen to use FFox, get that handy "Live HTTP Headers" plugin and check what headers your browser transmits when loading the relevant page. Then try to do the same.
THIS HAS BEEN SOLVED - SEE ANSWER AT THE END OF THIS POST
I am trying to retrieve data from a remote server using PHP / cURL
If I put the following URL into a browser the data comes back correctly.
http://realm103.c7.castle.wonderhill.com/api/map.json?user%5Fid=5245274&x=375&y=375×tamp=1310554325&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=098d2deb0a37f18c97428d636c456572f9bade24&version=3
However when I try to access if with PHP / cURL it just times out (error code 28).
$json = curl($jsonurl, $realm['intRealmID'], $realm['strRealmServer']);
function curl($url, $realm, $realmServer){
$header = array();
$header[] = 'Host: realm'.strval($realm).'.'.$realmServer.'.castle.wonderhill.com';
$header[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Accept-Encoding: gzip,deflate';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
return curl_exec($ch);
curl_close($ch);
}
Anybody have any ideas why it works from the browser but not via cURL? Thanks
ADDITIONAL INFO
Whilst cURL isn't working for the URL above. For the URL below it works just fine. The only difference is the server the data is being requested from. The data itself and POST is identical.
http://realm4.c5.castle.wonderhill.com/api/map.json?user%5Fid=1053774&x=375&y=375×tamp=1310616808&%5Fsession%5Fid=5b2070a46a083a33e053d60dbc2d062e&dragon%5Fheart=f35f476facab91f0e901eaf2209a0c8a9b9bedcc&version=3
ANSWER
Finally back to this and found that the referrer was the problem. The server was expecting to see no referrer in the request header. When it did the request was blocked. That behaviour probably was not consistent across all servers at the time but it is now. Removing the referrer from the request header and leaving everything else the same now works.
The biggest difference between your cURL function and requesting the information directly is the CURLOPT_HEADER property, I would first try removing this from the code.
try this
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('your url');
Alternatively, you can use the file_get_contents function remotely, but many hosts don't allow this
$userAgent = ‘Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0’;
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
Some other options I use:
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
try this:
$ctx = stream_context_create( array(
'socket' => array(
'bindto' => '192.168.0.107:0',
)
));
$c= file_get_contents('http://php.net', 0, $ctx);