Issue downloading some files using PHP curl - php

I have a script that downloads PDF files after logging into another site. It has so far worked great for all sites but I am now getting something strange with a new site I'm scraping: some of the files downloaded are 1kb (i.e it didn't work) while others work just fine. Using the download link in the browser opens the "do you want to save this file" window and the file is correct there.
here is my code (I include both the general curl parameters used throughout the scrape, and the final part where I try downloading the files):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
The files are under 1mb each. Any ideas? How can I see more details about why it's not working (e.g timeout)? Thanks!

Related

(PHP) Getting Blank Page with Curl (Mytischtennis)

I´m new to php and trying to program a few things.
In the first step I want to show the following page: "https://www.mytischtennis.de/public/home" on my website. I am using Curl to grab the page. But every time I want to output the page I am getting a blank page.
My code looks like this:
<?php
function grab_page($site){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec($ch);
ob_end_clean();
curl_close($ch);
}
echo grab_page("https://www.mytischtennis.de/public/home");
echo "hallo";
?>
With other pages this code works. Only for "https://www.mytischtennis.de/public/home" it wont work for me?
Can someone help me why i get only with this site a blank page?
Thank you :)
You need to set two more options in your curl request:
// Add some more headers here if you need to
$headers = [
"Accept-Encoding: gzip, deflate, br"
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// Decode the response automatically when needed
curl_setopt($ch, CURLOPT_ENCODING, '');
After this you will get the page you want.

Get Facebook cookies while login using CURL PHP

I want to get Facebook cookies then save it into text file. I can login successfully But i cant save cookies info my directory as txt format.
$cookie = 'cookie.txt'
function Get_Login($em,$pa,$cookie){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://m.facebook.com/login.php');
curl_setopt($ch, CURLOPT_POSTFIELDS,'email='.urlencode($em).'&pass='.urlencode($pa).'&login=Login');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_HTTPHEADER,$ip);
curl_setopt($ch, CURLOPT_COOKIEJAR,$cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE,$cookie);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+');
curl_setopt($ch, CURLOPT_REFERER,"https://www.facebook.com");
$body = curl_exec($ch) or die(curl_error
($ch));
}
Using this code i can login successfully. I want to save cookies file into my directory.
<?php
$body = curl_exec($ch) or die(curl_error($ch));
$cookie = 'cookie.txt';
// Open the file to get existing content
$current = file_get_contents($cookie);
// Append a new person to the file
$current .= $body;
// Write the contents back to the file
file_put_contents($cookie, $current);
?>

I can not make the cookie is written with curl and php

I'm in this new medium curl but I search the internet for a solution and can not find it. I'm trying to fill a remote form using curl and send data by post. the problem is that the external website has some security measures. One of those is that I need to complete the form to get the value that was generated and keep the cookie. external code page reads:
document.getElementById('sell_session').value = readCookie('classified_session');
My code is this:
$cookie_file = "/home/reelonhe/public_html//temp/cookie.txt";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://www.olx.com.ar/posting.php?categ_id=857');
curl_setopt($ch , CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2');
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_SSLVERSION,3);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$result1 = curl_exec($ch);
$error = curl_error($ch);
$contents = curl_exec($ch);
$httpcode = curl_getinfo($ch,CURLINFO_HTTP_CODE);
curl_close($ch);
echo $error;
I tried the absolute path of the cookie with relative path. etc folder and nothing has permission to read and write. Do not know what else to do.
In your cookie file path
$cookie_file = "/home/reelonhe/public_html//temp/cookie.txt";
There seems an extra '/' before temp folder.
It should be
$cookie_file = "/home/reelonhe/public_html/temp/cookie.txt";
I am not sure it will solve your problem.

PHP Curl shows different page than in the browser

I am trying to scrape a list of bills from a website after logging into it via curl but on one of the pages the content is not the same as in my browser (namely, instead of showing a list of bills it shows "Your bill history cannot be displayed"). I can correctly scrape other pages that are only available after login so I'm quite puzzled by why that page refuse to display the bill history when I use curl.
Here is my code:
//Load login page
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Submit post to login page to authentify
$postVariables = 'emailAddress='.$username.
'&password='.$password;
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login/POST.servlet');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/login');
$webpage = curl_exec($ch);
//Go to my account main page now that we are logged in
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/My_Account');
curl_setopt($ch, CURLOPT_REFERER, $target);
$webpage = curl_exec($ch); //shows the same content as in the browser
$accountNumber = return_between($webpage, 'id="accountID1">', '<', EXCL); //this is correctly found
//Go to bills page
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Bill_History/?accountnumber='.$accountNumber);
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/My_Account');
$webpage = curl_exec($ch); //Not showing the same content as in the browser
The last curl_exec being the one that doesn't work properly.
I have checked extensively the logic of the page and used Tamper Data to analyse what was going on: there doesn't seem to be any javascript / ajax call that would pull the bill history separately, and no POST request: as far as I can see the bill history should be displayed at page load.
Any ideas as to what I could try to fix it or what could be the problem? The fact that it works on other pages is especially puzzling.
Thanks in advance!
EDIT: it still doesn't work but I have found another page on their site where I can get what I need and where the content is displayed correctly - so no need for a solution anymore.
You might add additional header fields that "real" browsers usually transmit:
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
Just to name a few.
If you happen to use FFox, get that handy "Live HTTP Headers" plugin and check what headers your browser transmits when loading the relevant page. Then try to do the same.

CURL page to Lynda.com

I am a member of Lynda.com, I want to fetch a HTML page from their site and save it onto my disk, the problem is whenever I try to fetch a page via CURL, I get the non-member page (it asks me to sign up), I cant understand why I cant get the members page :(
My code:
get_remote_file_to_cache();
function get_remote_file_to_cache()
{
$the_site = "http://www.lynda.com/AIR-3-0-tutorials/Flex-4-6-and-Mobile-Apps-New-Features/90366-2.html";
$curl = curl_init();
$fp = fopen("cache/temp_file.html", "w");
curl_setopt($curl, CURLOPT_URL, $the_site);
curl_setopt($curl, CURLOPT_COOKIE, '/cookie.txt');
curl_setopt($curl, CURLOPT_FILE, $fp);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$http_headers = array(
'Host: www.lynda.com',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2',
'Accept: */*',
'Accept-Language: en-us,en;q=0.5',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Connection: keep-alive'
);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $http_headers);
curl_exec($curl);
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if($httpCode == 404)
{
touch('cache/404_err.txt');
}
else
{
$contents = curl_exec($curl);
fwrite($fp, $contents);
}
curl_close($curl);
}
I am on Windows 7 and running on this on WAMP.
One of the things I am not sure about is if the "cookie.txt" file is getting read or not (not sure if the path is correct so I put the cookie.txt file in the root of the server as well as in the directory I am running this script from).
Thanks in advance!
----------- Found some code via the online manual ---------
// $url = page to POST data
// $ref_url = tell the server which page you came from (spoofing)
// $login = true will make a clean cookie-file.
// $proxy = proxy data
// $proxystatus = do you use a proxy ? true/false
function
curl_grab_page($url,$ref_url,$data,$login,$proxy,$proxystatus){
if($login == 'true') {
$fp = fopen("ryanCookie.txt", "w");
fclose($fp);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "ryanCookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "ryanCookie.txt");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'true') {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref_url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
ob_start();
return curl_exec ($ch); // execute the curl command
ob_end_clean();
curl_close ($ch);
unset($ch);
}
echo curl_grab_page("https://www.lynda.com/login/login.aspx", "http://www.lynda.com/", "simple_username=*******&simple_password=*******", "true", "null", "false")."done!";
But it still does not work :(
This is the page where I got the above code: http://php.net/manual/en/function.curl-setopt.php
You need to understand how the internet and http work. You see, when you access a website, they usually give you cookies to track your status. You will also start as non logged-in member. After you hit login button, the server will update your status to logged-in and store this status, either in server site session or in your browser using cookies.
Back to your question, since you want to access member page, this mean, you need to do the following step by first, learn how lynda.com work. However, my step below is rather general:
Load login page and get the form information
inject form information with your login info and send the form back to server
store cookies received from server
load member page (don't forget to include cookies information from step 4) and fetch the html
For more information, you can look at this resources:
http://www.codingforums.com/showthread.php?t=252335
http://simpletest.sourceforge.net/en/browser_documentation.html
https://gist.github.com/3697293
Maybe you need to send Authorization header, which contain your username and password for the site in the HTTP header part.
To get the member page you need to login on the website. To do that, you need to:
visit login page
make the same request as your browser would do to submit login credentials
fetch the member page
Alternatively, you could try to extract cookies from your browser after login and use them in curl with curl_setopt($ch, CURLOPT_COOKIE, 'a=b;c=d');, but this might not work as the website can also use IP or session check.

Categories