jibberish returned when using CURLOPT_URL - php

Im trying to grab a pages data using CURLOPT_URL, to do so ive used the below code, which is working fine for other pages (with the exception of where the page uses relative paths to its css / js in which case those dont load) .
function grab_page($site){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_TIMEOUT, 40000000);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec ($ch);
ob_end_clean();
curl_close ($ch);
}
echo grab_page("$page_to_get");
But when i load the page i get returned a screen of jibberish like this, but a whole page, same when i view the source.
Looking at the source of the page, through my browser, they seem to be using charset=utf-8", im not sure if that has anything to do with it though ? Any ideas ?

Calling:
curl_setopt($ch,CURLOPT_ENCODING , "gzip");
will fix it if the encoding is know to be gzipped or as you stated
curl_setopt($ch,CURLOPT_ENCODING , "");
should tickle curl into negotiating the encoding itself (why this is not the default is beyond me)

Related

php curl not retrieving as expected

I have the following code to capture the html code of a given url:
$url = "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=77212&cvm=true";
$ch = curl_init();
curl_setopt($ch, CURLOPT_CAINFO, '/etc/ssl/certs/cacert.pem');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
echo "$url\n\n";
die($html);
For some reason the result of the following url is not as expected:
"https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=77212&cvm=true"
Instead of the code, the result is a giant meaningless string.
I've have successfully used the same code with other pages of the same domain.
I can assure that the desired page's content is not loaded by any js/ajax method (i did the test loading the page when disabling javascript).
My question is:
There is any cUrl option that i should set to correct this error?
My whole site depends on capturing this pages.
Any help would be truly appreciated.
That is base64 encoded, all you need to do is decode it back to plain text like this
echo base64_decode($html);
and you will see HTML

Page have reloaded all time when I try to get her via cURL

So, I need to parse some content from http://israelbar.org.il, and using for this cURL, but when I run script - tab in browser all time reload, and nothing showed.
$browser = curl_init();
curl_setopt($browser, CURLOPT_URL, $url);
curl_setopt($browser, CURLOPT_REFERER, $referer);
curl_setopt($browser, CURLOPT_USERAGENT, $agent);
curl_setopt($browser, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($browser, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($browser, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($browser, CURLOPT_CONNECTTIMEOUT, 10); //times out after 11s
curl_setopt($browser, CURLOPT_TIMEOUT, 50); //times out after 51s
curl_setopt($browser, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($browser, CURLOPT_COOKIEFILE, $cookie_file_path);
$retVal = curl_exec ($browser);
curl_close ($browser);
unset($browser);
return $retVal;
Also, I try NodeJS, and get some, unknown for me listing of JS code in console.
I think, that main problem - it's different headers, and I must send the same headers via cURL, like a browser.
Have you tried capturing the working browser request with a tool like Fiddler? (http://www.telerik.com/fiddler). Also, can you clarify the language and set up you are using (looks like PHP?) And which 'tab' is reloading (is it your own site's?)
Other things to try:
- Call a site you know will work first -- to ensure your code is correct
- Adjust the timeout values to much larger values when testing

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

cURL to get response from Google Safe Browsing Lookup

I'm trying to get response using the Google Safe Browsing Lookup API like this:
$url ="https://sb-ssl.google.com/safebrowsing/api/lookup?client=myappname&apikey=mykey&appver=1.0&pver=3.0&url=".urlencode($myurl);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
$body = curl_exec($ch);
$info = curl_getinfo($ch);
I know that my URL is correct since if I dump it and then paste it to the browser I get the expected result (for example 'malware').
So I am assuming it must be something with cURL. I'm working on localhost and extension=php_curl.dll is un-commented in my php.ini, Php version 5.4.4
The html_code is always 0
Simply set CURLOPT_SSL_VERIFYPEER and CURLOPT_SSL_VERIFYHOST to false.
Anything wrong with my cURL code (http status of 0)?
Also check http://code.google.com/p/twitter-api/issues/detail?id=1291 , it might help. It is different APIs, but with the same problem anyway.
The problem and its solution is very nicely described here:
http://richardwarrender.com/2007/05/the-secret-to-curl-in-php-on-windows/
Basically if you are not using a standalone version of cURL the chances are that the cURL functions do not include a certificate bundle which was needed in my case since I was trying to connect to secure host.
$ch = curl_init();
// Apply various settings
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_CAINFO, "C:/xampp/ca-bundle.crt"); //path to the CA-bundle
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);

curl problem, can't download full web page

With this code I'm trying to download this web page: http://www.kayak.com/s/...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://www.kayak.com/s/search/air?ai=kayaksample&do=y&ft=ow&ns=n&cb=e&pa=1&l1=ZAG&t1=a&df=dmy&d1=4/10/2010&depart_flex=exact&r1=y&l2=LON&t2=a&d2=11/10/2010&return_flex&r2=y');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_REFERER,"http://wwww.google.com");
$content = curl_exec ($ch);
echo $content;
You can see the demo at: http://www.pointout.org/test.php
As you can see the part with prices is missing.
What could be wrong?
This is not going to work the way you think it will. The reason is the prices are not in the initial HTML response that you get. Rather, there is some Javascript magic occurring which is using AJAX to load the prices when the page is loaded.

Categories