I am trying to scrape a list of bills from a website after logging into it via curl but on one of the pages the content is not the same as in my browser (namely, instead of showing a list of bills it shows "Your bill history cannot be displayed"). I can correctly scrape other pages that are only available after login so I'm quite puzzled by why that page refuse to display the bill history when I use curl.
Here is my code:
//Load login page
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Submit post to login page to authentify
$postVariables = 'emailAddress='.$username.
'&password='.$password;
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login/POST.servlet');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/login');
$webpage = curl_exec($ch);
//Go to my account main page now that we are logged in
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/My_Account');
curl_setopt($ch, CURLOPT_REFERER, $target);
$webpage = curl_exec($ch); //shows the same content as in the browser
$accountNumber = return_between($webpage, 'id="accountID1">', '<', EXCL); //this is correctly found
//Go to bills page
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Bill_History/?accountnumber='.$accountNumber);
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/My_Account');
$webpage = curl_exec($ch); //Not showing the same content as in the browser
The last curl_exec being the one that doesn't work properly.
I have checked extensively the logic of the page and used Tamper Data to analyse what was going on: there doesn't seem to be any javascript / ajax call that would pull the bill history separately, and no POST request: as far as I can see the bill history should be displayed at page load.
Any ideas as to what I could try to fix it or what could be the problem? The fact that it works on other pages is especially puzzling.
Thanks in advance!
EDIT: it still doesn't work but I have found another page on their site where I can get what I need and where the content is displayed correctly - so no need for a solution anymore.
You might add additional header fields that "real" browsers usually transmit:
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
Just to name a few.
If you happen to use FFox, get that handy "Live HTTP Headers" plugin and check what headers your browser transmits when loading the relevant page. Then try to do the same.
Related
Onsite booking process now i am using rest api calling to get the data about booking process.But now the problem is that when I set the form url is :-
$url = 'https://book.api.ean.com/ean-services/rs/hotel/v3/res?
minorRev=99
&cid=55505
&sig=1893d9f7e3e9fbd3f8a36f43cd61287d
&apiKey=1bn8n4or4tjajq23fe4l6m18lp
&customerUserAgent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0
&customerIpAddress=223.30.152.118
&customerSessionId=e80df6de9008af772cfb48a389465415
&locale=en_US
¤cyCode=USD
&hotelId=106347
&arrivalDate=10/30/2015
&departureDate=11/01/2015
&supplierType=E
&rateKey=469e1aff-49de-4944-a64d-25d96ccde3aa
&roomTypeCode=200127420
&rateCode=200706716
&chargeableRate=257.20
&room1=2,5,7
&room1FirstName=test
&room1LastName=testers
&room1BedTypeId=23
&room1SmokingPreference=NS
&email=test#yourSite.com
&firstName=tester
&lastName=testing
&homePhone=2145370159
&workPhone=2145370159
&creditCardType=CA
&creditCardNumber=5401999999999999
&creditCardIdentifier=123
&creditCardExpirationMonth=11
&creditCardExpirationYear=2015
&address1=travelnow
&city=Bellevue
&stateProvinceCode=WA
&countryCode=US
&postalCode=98004';
and when i manually posted the data it will get the response But when I am using curl to post the url which i have posted previous it will face the error.
My curl code is :-
$header[] = "Accept: application/json";
$header[] = "Accept-Encoding: gzip";
$header[] = "Content-length: 0";
$ch = curl_init();
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$verbose = fopen('php://temp', 'rw+');
curl_setopt($ch, CURLOPT_STDERR, $verbose);
$result = curl_exec($ch);
After posting data i will get the response
{"HotelRoomReservationResponse":{"EanWsError":{"itineraryId":-1,"handling":"UNRECOVERABLE","category":"EXCEPTION","exceptionConditionId":-1,"presentationMessage":"TravelNow.com cannot service this request.","verboseMessage":"Exception Caught: null"},"customerSessionId":"8ab1d482-f968-49d2-a429-a1cbab748fe5"}}
So i will get that error repeatedly. Please help me how i can find the right data.
Your problem here is that you are parsing the parameters in the url they need to be given in the body see thetop of this page: http://developer.ean.com/docs/book-reservation/examples/rest-reservation/
Not sure how you do this in PHP but you can use -d on the command line
I'm doing some scraping with php. I've been extracting data including link to the next relevant page so the whole thing is automatic. The problem is that I seem to be getting a page which is slightly modified compared to what I would expect using that URL in my browser (for e.g. the dates are different).
I've tried using curl and get_file_contents but both get the wrong file.
At the moment I am using:
$url = "http://www.example.com";
$ch = curl_init();
$timeout = 5;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
url_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$temp = curl_exec($ch);
curl_close($ch);
What is going on here?
UPDATE:
I've tried imitating a browser using the following code but still unsuccessful. I find this bizarre.
function get_url_contents($url){
$crl = curl_init();
$timeout = 10;
$header=array(
'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Encoding: gzip,deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($crl, CURLOPT_AUTOREFERER, FALSE);
curl_setopt ($crl, CURLOPT_FOLLOWLOCATION, FALSE);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
Further update:
Seems that the site is using my location to discriminate. Is there a locale option?
Can be many things...
Server may render pages differently based on cookies and header sent
Server may render pages differently based on existing pre-conditions and states on the server
You may have a proxy in between that modifies the content based on user-agent and since you don't have a specific user-agent (such as CURL browser) then your proxy is sending back different content
This is just a few things that could happen!
I have a script that downloads PDF files after logging into another site. It has so far worked great for all sites but I am now getting something strange with a new site I'm scraping: some of the files downloaded are 1kb (i.e it didn't work) while others work just fine. Using the download link in the browser opens the "do you want to save this file" window and the file is correct there.
here is my code (I include both the general curl parameters used throughout the scrape, and the final part where I try downloading the files):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
The files are under 1mb each. Any ideas? How can I see more details about why it's not working (e.g timeout)? Thanks!
I am a member of Lynda.com, I want to fetch a HTML page from their site and save it onto my disk, the problem is whenever I try to fetch a page via CURL, I get the non-member page (it asks me to sign up), I cant understand why I cant get the members page :(
My code:
get_remote_file_to_cache();
function get_remote_file_to_cache()
{
$the_site = "http://www.lynda.com/AIR-3-0-tutorials/Flex-4-6-and-Mobile-Apps-New-Features/90366-2.html";
$curl = curl_init();
$fp = fopen("cache/temp_file.html", "w");
curl_setopt($curl, CURLOPT_URL, $the_site);
curl_setopt($curl, CURLOPT_COOKIE, '/cookie.txt');
curl_setopt($curl, CURLOPT_FILE, $fp);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$http_headers = array(
'Host: www.lynda.com',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2',
'Accept: */*',
'Accept-Language: en-us,en;q=0.5',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Connection: keep-alive'
);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $http_headers);
curl_exec($curl);
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if($httpCode == 404)
{
touch('cache/404_err.txt');
}
else
{
$contents = curl_exec($curl);
fwrite($fp, $contents);
}
curl_close($curl);
}
I am on Windows 7 and running on this on WAMP.
One of the things I am not sure about is if the "cookie.txt" file is getting read or not (not sure if the path is correct so I put the cookie.txt file in the root of the server as well as in the directory I am running this script from).
Thanks in advance!
----------- Found some code via the online manual ---------
// $url = page to POST data
// $ref_url = tell the server which page you came from (spoofing)
// $login = true will make a clean cookie-file.
// $proxy = proxy data
// $proxystatus = do you use a proxy ? true/false
function
curl_grab_page($url,$ref_url,$data,$login,$proxy,$proxystatus){
if($login == 'true') {
$fp = fopen("ryanCookie.txt", "w");
fclose($fp);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "ryanCookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "ryanCookie.txt");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'true') {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref_url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
ob_start();
return curl_exec ($ch); // execute the curl command
ob_end_clean();
curl_close ($ch);
unset($ch);
}
echo curl_grab_page("https://www.lynda.com/login/login.aspx", "http://www.lynda.com/", "simple_username=*******&simple_password=*******", "true", "null", "false")."done!";
But it still does not work :(
This is the page where I got the above code: http://php.net/manual/en/function.curl-setopt.php
You need to understand how the internet and http work. You see, when you access a website, they usually give you cookies to track your status. You will also start as non logged-in member. After you hit login button, the server will update your status to logged-in and store this status, either in server site session or in your browser using cookies.
Back to your question, since you want to access member page, this mean, you need to do the following step by first, learn how lynda.com work. However, my step below is rather general:
Load login page and get the form information
inject form information with your login info and send the form back to server
store cookies received from server
load member page (don't forget to include cookies information from step 4) and fetch the html
For more information, you can look at this resources:
http://www.codingforums.com/showthread.php?t=252335
http://simpletest.sourceforge.net/en/browser_documentation.html
https://gist.github.com/3697293
Maybe you need to send Authorization header, which contain your username and password for the site in the HTTP header part.
To get the member page you need to login on the website. To do that, you need to:
visit login page
make the same request as your browser would do to submit login credentials
fetch the member page
Alternatively, you could try to extract cookies from your browser after login and use them in curl with curl_setopt($ch, CURLOPT_COOKIE, 'a=b;c=d');, but this might not work as the website can also use IP or session check.
Here is the login page:
http://www.ifreewind.net/iFreeWind.aspx
I need content of this page which need login first:
http://www.ifreewind.net/Users/Search.aspx?R=1&P=00102&i=2
On case you need, my post data content is here:
$data = "__VIEWSTATE=%2FwEPDwULLTE3NjQ3MDc3NDQPZBYCAgMPZBYCAgEPFgIeB1Zpc2libGVoZBgBBR5fX0NvbnRyb2xzUmVxdWlyZVBvc3RCYWNrS2V5X18WAQUSUmVtZW1iZXJNZUNoZWNrQm94r57YdIUtbSps%2FGLW1PUtjxcILdE%3D&__EVENTVALIDATION=%2FwEWBQLKivfjBgLw2N3fDgLC9%2FChAwLxuKbKAgL%2BjNCfDwU6DJjH4Q2acTlGVXmDrSv2Nn4G&UserNameTextBox=myemailaddress%40gmail.com&PasswordTextBox=mypassword&LoginButton=%E7%99%BB%E9%99%86";
curl_setopt($datapost, CURLOPT_POSTFIELDS, $data);
My code is here, but not work:
$site = "http://www.ifreewind.net/Users/Search.aspx?R=1&P=00102&i=2";
$ch = curl_init();
$headers = array('Host:www.ifreewind.net',
'User-Agent:Mozilla/5.0 (Windows NT 5.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language:zh-cn,zh;q=0.5',
'Accept-Encoding:gzip, deflate',
'Accept-Charset:GB2312,utf-8;q=0.7,*;q=0.7',
'Connection:keep-alive',
'Referer:http://www.ifreewind.net/Users/Search.aspx?R=1&P=00102&i=2',
'Cookie:Hm_lvt_7fa3bcf45d96b91c6a87d1433c045849=1327324205986; VisitUrl_-1=ok; ASP.NET_SessionId=4smyrujt3m3cnu2sxbh55z55; Hm_lpvt_7fa3bcf45d96b91c6a87d1433c045849=1327324205986; MyId=7087; iTechAuthen=FDDE38649ADA11A5C73923D4D9437097226833721D48739F39720A91C49A95DCC345C8E9DE670B71D837808619CBF23213C6252AE82112A06CE37271D7D1A3466979E2B264845C8C75B7E1791DDB49C910178DA0BC6D5BD4D6AC536842279D41FA2866DA5B4F278BAB6443D2F370B96F1E5723C685AA015BE611317F40F66965DD2CF0FD5E7C1DB794D7172CC784EF1C2B773CFCAE05772EE611B6F82EF6894F8B32EA932D01F81F70F73C18F1CB8C6F3DDC5E44',
'Cache-Control:max-age=0'
);
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_TIMEOUT, 6000);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 60);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_POST, TRUE);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);//add
ob_start();
return curl_exec($ch);
ob_end_clean();
curl_close($ch);
unset($ch);
Actually my code will meet strange results in web browser like a lot of "����ܰ��", I tried to switch language charset in firefox, but utf-8 and others just doest't show well, so please help.
You can use a header parser to get back the PHPSESSID. This makes for cleaner future curl requests. Heres an example.
This was part of some object oriented programming where I could CURL into different parts of the site, and send my sessionID to track the session.
function login($email,$password){
$login=false;
$post='member[email]='.urlencode($email).'&member[password]='.urlencode($password);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://signon.page/login.php');
curl_setopt($ch,CURLOPT_USERAGENT,$this->useragent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
//POST
curl_setopt($ch,CURLOPT_POST,4);
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);
curl_setopt($ch, CURLOPT_HEADER, 1);
$output=curl_exec($ch);
//get cookies in map array
$rows=explode("\n",$output);
foreach($rows as $num=>$row){
$trim=substr($row,0,5);
$trim2=substr($row,0,29);
if ($trim2=="Location: /public/member/home")$login=true;
/* if the site sends back a header redirect my login worked.*/
if ($trim=="Set-C") {$rownum=$num;}}
$cookies=$rows[$rownum];
$cookies=substr($cookies,12);/*RAW COOKIE*/
$cookies=explode("; ",$cookies);
$arr=array();
foreach ($cookies as $n=>$v){
$s=explode("=",$v);
$arr[$s[0]]=$s[1];}
$cookies=$arr;
$_SESSION['SN']=$cookies['PHPSESSID'];
curl_close($ch);
$_SESSION['auth']=$login;
return $login;}//end isLoggedIn