php curl not retrieving as expected - php

I have the following code to capture the html code of a given url:
$url = "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=77212&cvm=true";
$ch = curl_init();
curl_setopt($ch, CURLOPT_CAINFO, '/etc/ssl/certs/cacert.pem');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
echo "$url\n\n";
die($html);
For some reason the result of the following url is not as expected:
"https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=77212&cvm=true"
Instead of the code, the result is a giant meaningless string.
I've have successfully used the same code with other pages of the same domain.
I can assure that the desired page's content is not loaded by any js/ajax method (i did the test loading the page when disabling javascript).
My question is:
There is any cUrl option that i should set to correct this error?
My whole site depends on capturing this pages.
Any help would be truly appreciated.

That is base64 encoded, all you need to do is decode it back to plain text like this
echo base64_decode($html);
and you will see HTML

Related

CURL does not give full html contents

I'm trying to parse a shopping site webpage using CURL in PHP.
the url is: http://computers.pricegrabber.com/printers/HP-Officejet-Pro-8600-Plus-All-One-Wireless-Inkjet-Printer/m916995235.html/zip_code=97045/sort_type=bottomline
Here's the code I use.
function getWebsiteCURL($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
echo getWebsiteCURL("http://computers.pricegrabber.com/printers/HP-Officejet-Pro-8600-Plus-All-One-Wireless-Inkjet-Printer/m916995235.html/zip_code=97045/sort_type=bottomline");
It works, but I couldn't get the full HTML code.
Anybody have any idea why?
TIA.
This is often caused by connection timeout.
Try setting the opt:
CURLOPT_TIMEOUT => 120
curl cannot interpret Javascript, you can see what curl will see if you disable javascript in a browser and navigate to the page. If you need to interpret Javascript then I would use a headless browser like phantomjs. In PHP you could use PHP PhantomJS.

How do I use cURL & PHP to spoof the referrer?

I'm trying to learn cURL with PHP to spoof the referrer to a website.
With the following script I expected to accomplish this...but it seems to not work.
Any ideas/suggestion where I am going wrong??
Or do you know of any tutorials that could help me figure this out?
Thanks!
Jessica
<?php
$host = "http://mysite.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $host);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
curl_setopt($ch, CURLOPT_REFERER, "http://google.com");
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$result = curl_exec($ch);
curl_close($ch);
?>
You wont be able to see the result in webserver's analytics because it might probably using a javascript to get the analytics and curl wont run/execute the javascript. All Curl will do is get the content of the page as it like it is a text file. It wont run any of the scripts or anything.
To be more clear if you have an html tag like
<img src="path/to/image/image.jpg" />
The curl will treat it as a line of text. it wont load the image.jpg from the server. The same goes with the js if their is a
<script type="text/javascript" src="analytics.js"></script>
Normally the browser will load that analytics.js and run it, but the curl wont.

curl problem, can't download full web page

With this code I'm trying to download this web page: http://www.kayak.com/s/...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://www.kayak.com/s/search/air?ai=kayaksample&do=y&ft=ow&ns=n&cb=e&pa=1&l1=ZAG&t1=a&df=dmy&d1=4/10/2010&depart_flex=exact&r1=y&l2=LON&t2=a&d2=11/10/2010&return_flex&r2=y');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_REFERER,"http://wwww.google.com");
$content = curl_exec ($ch);
echo $content;
You can see the demo at: http://www.pointout.org/test.php
As you can see the part with prices is missing.
What could be wrong?
This is not going to work the way you think it will. The reason is the prices are not in the initial HTML response that you get. Rather, there is some Javascript magic occurring which is using AJAX to load the prices when the page is loaded.

How to get CURLs reponse in Array

I have one problem regarding CURL.
I am trying send CURL request to http://whatismyipaddress.com
so..
is there any way to get curl response in array ?
because right now it display the HTML Page but i want the response in array.
here is my code...
$ipAdd = '121.101.152.170';
$ch = curl_init("http://whatismyipaddress.com/ip/".$ipAdd);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,1);
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
So.. currently i am getting the detail HTML page but i want the out put as array or XML.
The "easiest" path is going to be to find the surrounding text & extract based on that.
If you're willing to step your dedication to this up, you can use something like http://simplehtmldom.sourceforge.net/ & go from there.
edit: actually you can use this (it is built into php5) - http://php.net/manual/en/book.dom.php
more specifically, this - http://www.php.net/manual/en/domdocument.loadhtml.php

Curl grab HTML content

I have a webpage which requires a login.
I am using curl to build the HTTP authentication request. It works, but I am not able to grab all the content from my links. I miss all the images.
How can I grab the images as well?
<?php
// create cURL resource
$URL = "http://10.123.22.38/nagios/nagvis/nagvis/index.php?map=Nagvis_CC";
//Initl curl
$ch = curl_init();
//Set HTTP authentication option
curl_setopt($ch, CURLOPT_URL, $URL); // Load in the destination URL
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC); //Normal HTTP request, not SSL
curl_setopt($ch, CURLOPT_USERPWD, "guest:test" ); // Pass the user name and password
// grab URL and pass it to the browser
$content = curl_exec($ch);
$result = curl_getinfo($ch);
// close cURL resource, and free up system resources
curl_close($ch);
echo $content;
echo $result;
?>
I'm getting this warning message Warning: curl_error(): 2 is not a valid cURL handle resource in C:\xampp\htdocs\LiveServices\LoginTest.php on line 24
cURL doesn't get images or any other 'content', it just gets the raw HTML page. Are you saying you are missing <img /> tags that are present on the original page?
cURL also doesn't parse any CSS or JavaScript, so if the content is modified with those, it won't come through. For example, you may be unable to get a background-image of an element unless you do more scraping, that is, get the associated CSS file and parse that.
The main issue I have is that I cannot see the html, so I cannot be sure what the problem is. Having said that, two things occur to me.
The first thing to check is if the images are relative or not. If they are displayed in the form ../xyz/foo.jpg or foo.jpg then you will either need to edit the images src to the full url or add the base tag to the html
For parsing HTML, use the Simple HTML DOM library as it is faster than rolling your own.
The second issue may be that the images also require the user to be logged in. If this is the case you would also have to download all the images, and either embed them in the content after base 64 encoding them, or store them temporally on your server.
Here the some html codes:
The images that I want to get:
<img id="backgroundImage" style="z-index: 0;" src="/nagios/nagvis/nagvis/images/maps/Nagvis_CC.png"/>
<a href="/nagios/cgi-bin/extinfo.cgi?type=2&host=business_processes&service=NLThirdPartyLive" target="_self">
And a lot of javascript.
I tried to use simple HTML dom libray, but the output is array. nothing
require("/simplehtmldom/simple_html_dom.php");
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'WhateverBrowser1.45');
curl_setopt($ch, CURLOPT_URL, 'http://10.123.22.38/nagios/nagvis/nagvis/index.php?map=Nagvis_CC');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC); //Normal HTTP request, not SSL
curl_setopt($ch, CURLOPT_USERPWD, "guest:test" ); // Pass the user name and password
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
$result = curl_exec($ch);
$html= str_get_html($result);
echo $ret= $html->find('table[class=header_table]');
echo $result;

Categories