Cryptic result when crawling a site with simple html Dom

Cryptic result when crawling a site with simple html Dom - php

I'm currenlty trying to gather some datas from politifact using simple html dom, but a lot of the time I have weirds errors instead of the html expected.
The goal is not to bruteforce the site but to request it once or twice a day and cache the result.
Here most of the returns I get :
‹������í]{wÛ6²ÿ»=g¿ªn#»1EËJœÄ–µ×vœ&ÙÄñÚn²{r{|( ’S$Ã‡euÛï~3à¤¨‡c'ÛísNÄ`f0˜Úß=}sxþ¯“#1ŠÆŽ8ùùàÕ‹CQ3Ló]ëÐ4Ÿž?ÿ|~þú•h66Åy`¹¡Ùžk9¦yt\µQù;¦9™L“...
And here's the super simple code :
$html = file_get_html('http://www.politifact.com/personalities/barack-obama');
print_r($html->plaintext);
Do you have any ideas why ?
Some sort of protection/redirection on the website side ?
Thank you very much !

You received the expected page, but in gzip format. It looks like the server doesn't mind if the accept-encoding header is not included in the request and instead of sending a default plain text response, sends a gzipped data anyway.
I don't think simple-html-dom can unzip the data, but you can use cURL for that purpose:
$ch = curl_init('http://www.politifact.com/personalities/barack-obama/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
$data = curl_exec($ch);
$html = str_get_html($data);

Related

simple_html_dom doesn't take data from some websites

simple_html_dom does not take data from some websites.
For the website www.google.pl, it downloads the source of the page,
but for other such as: gearbest.com, stooq.pl does not download any data.
require('simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/"); // work
/*
curl_setopt($ch, CURLOPT_URL, "https://www.gearbest.com/"); // dont work
curl_setopt($ch, CURLOPT_URL, "https://stooq.pl/"); // dont work
*/
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
curl_close($ch);
$html = new simple_html_dom();
$html->load($response);
echo $html;
What should I change in the code to receive data from websites?

The root problem here (at least on my computer, maybe different with
your version...) is that site returns gzipped data, and it isn't being
uncompressed properly by php and curl before being passed to the dom
parser. If you are using php 5.4, you can use gzdecode and
file_get_contents to uncompress it yourself.
<?php
// download the site
$data = file_get_contents("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035");
// decompress it (a bit hacky to strip off the gzip header)
$data = gzinflate(substr($data, 10, -8));
include("simple_html_dom.php");
// parse and use
$html = str_get_html($data);
echo $html->root->innertext();
Note that this hack will not work on most sites. The main reason
underlying this seems to me that curl doesn't announce that it accepts
gzip data... but the web server on that domain doesn't pay attention
to that header, and gzips it anyway. Then neither curl nor php
actually checks the Content-Encoding header on the response, and
assumes it isn't gzipped so it passes it through without an error nor
calling gunzip. Bugs in both the server and the client here!
For a more robust solution, maybe you can use curl to get the headers
and inspect them yourself to determine if you need to decompress it.
Or you can just use this hack for this site and the normal method for
others to keep things simple.
It might still also help to set the character encoding on your output.
Add this before you echo anything to ensure the data you read isn't
recorrupted in the user's browser by being read as the wrong charset:
header('Content-Type: text/html; charset=utf-8');

Get Content from Web Pages with PHP

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?

First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.

<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>

You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()

How can I send and receive data from an external webpage?

I am building a PHP script that needs to use content loaded from an external webpage, but I don't know how to send/receive data.
The external webpage is http://packer.50x.eu/ .
Basically, I want to send a script (which is manually done in the first form) and receive the output (from the second form).
I want to learn how to do it because it can surely be an useful thing in the future, but I have no clue where to start.
Can anyone help me? Thanks.

You can use curl to receive data from external page. Look this example:
$url = "http://packer.50x.eu/";
$ch = curl_init($url);
// curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // you can use some options for this request
// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5); // or not to use them
// you can set many others options. read about them in manual
$data = curl_exec($ch);
curl_close($ch);
var_dump($data); // < -- here is received page data
curl_setopt manual
Hope, this will help you.

You may want to look at file_get_contents($url) as it is very simple to use, simpler than CURL (more limited though), so your code could look like:
$url = "http://packer.50x.eu/";
$url_content=file_get_contents($url);
echo $url_content;
Look at the documentation as you could use offset and other tricks.

Attempting to read portion of HTML file with curl and the output is strange

set_time_limit(0);
$ch = curl_init('http://www.tibia.com/community/?subtopic=highscores&world=Antica');
curl_setopt($ch, CURLOPT_RANGE, '0-999');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
echo $data = curl_exec($ch);//get curl response
curl_close($ch);
And the example output is (it is slightly different every time I hit refresh button):
‹í]kSã8ºþNÕüOÍÕr™&™‚ ³ÜÒ3{æ¥8J¢Æ±2¾é=ûßÏ«‹Û±BãYÜåž,[¥÷òHz%[ÎÁØ›XÖÆàˆÐG=‹tz´O1ª S‡té[ZxF tm[Ô&è–YÝÀ%jÐ'¿½?¿<ütrÇË‹_X_BD1‡Ýîõç«^¤œàŠ(Èå%ÓwH¤$lšÌ·½°¨Ñý)s&W¼‰Ç¯Rb ¢‰Ð{dÐª zp×½=¿é¡ÞÿÞœ,úµ³vð·JU«0D#EÔ²ˆ³æ¯G'„A{Û0Ì1tjä´ÜØ4þŽšµZmóãÚÐ·MÞÆQ4{íßà=:DSì#³ß ò2Ý-‹Ø#oŒ:¨± ‰þ-üË1}‡=¹ÄéÍ¦¨ÝF%F‹ºßb²#i˜éO¢W> à]Ã»\b ·<6]~ßÿÿ?k v¥&¨JuÖÖÞ®PB)¶èWm ƒÁþ~}¿ÖÜÛö÷wš ÒØ®Õ04IØÜ!õdS€1½ªuHö
The page is displayed correctly when I comment out CURLOPT_RANGE
EDIT: I added
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
The output seems to be okay, but only when the range starts with 0. If the range is for example 2000-3000, it outputs completely nothing.
EDIT 2: The error message is: "Error while processing content unencoding: invalid distance too far back"

You're getting gzipped content. You should explicitly state that you want plain HTML returned. You can add the following option:
curl_setopt($ch, CURLOPT_ENCODING, 'deflate');

I've never used CURLOPT_RANGE. Is there a reason you need to use this?
The reason gzip only works when you start from 0 is there is information there that it needs to unzip the content. If you must use Range then you should capture the data for each range and combine it and then un-gzip that.
EDIT:
You mention in some comments that you use Range to get some of the data to save bandwidth. I checked the page using Firebug and it's < 10kb. With all the images it's almost 500k. You are already saving quite a bit and unless you're using dial-up internet 10kb it's nothing. Don't worry about using Range and combining the chunks, just let cURL handle the gzip.

How to process XML returned after submitting a form?

I have just started a project that involves me sending data using POST in HTML forms to a another companies server. This returns XML. I need to process this XML to display certain information on a web page.
I am using PHP and have no idea where to start with how to access the XML. Once I knwo how to get at it I know how to access it using XPath.
Any tips of how to get started or links to sites with information on this would be very useful.

You should check out the DOMDocument() class, it comes as part of the standard PHP installation on most systems.
http://us3.php.net/manual/en/class.domdocument.php
ohhh, I see. You should set up a PHP script that the user form posts to. If you want to process the XML response you should then pass those fields on to the remote server using cURL.
http://us.php.net/manual/en/book.curl.php
A simple example would be something like:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://the_remote_server");
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $_POST);
$YourXMLResponse = curl_exec($ch);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Cryptic result when crawling a site with simple html Dom - php

Related

simple_html_dom doesn't take data from some websites

Get Content from Web Pages with PHP

How can I send and receive data from an external webpage?

Attempting to read portion of HTML file with curl and the output is strange

How to process XML returned after submitting a form?

Categories

Resources