How to output DOMDocuments? - php

Maybe I am missing something... but the DOM Object is empty in this code:
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();
I really don't know what could be wrong with that code. I have verified that $input is actually filled with the html contents of the web page.
The output is:
object(DOMDocument)#3 (0) { }
I don't understand why...

This is expected behaviour. To see the HTML, use DOMDocument::saveHTML() (or saveXML()).

The output is: object(DOMDocument)#3 (0) { }
Yes. That's what a var_dumped DOMDocument looks like.
If you want to look at the HTML representation of the content inside the document, saveHTML() on it. That spits out a cleaned up version of the HTML on Google's home page for me.

Try this
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
$test=#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($test);
die();
//output
//bool(true)
?>
or try
$input = file_get_contents('http://www.google.com/');
$buffer = ob_get_clean();
$tidy = new tidy();
$input = $tidy->repairString($input);
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();

Related

Domdocument Load not loading

I am trying to load xml file through url(i.e. rss).
But when I Use
$doc = new DOMDocument();
$doc->load($url);
if($doc->load($url,LIBXML_NOWARNING)===false)
{
echo "Hello";
//echo #$doc->load($url,LIBXML_NOWARNING);
//exit;
$error = $doc->load($url);
print_r($error);exit;
}
It only prints Hello..
No warning displayed for line 2.
Please provide me solution that which error occurs as I am getting nothing.
Remove exit; from the code after echo "Hello"; that is the reason
That is because the rendered content is not visible. Try pressing Ctrl+U on your browser.
Also, instead of print_r try with var_dump
$error = $doc->load($url);
var_dump($error);
EDIT :
So it seems like your $doc->load failed in the first place. You need to change your if statement to
if($doc->load($url,LIBXML_NOWARNING)===true) // Replaced false with true.
or simply
if($doc->load($url,LIBXML_NOWARNING))
Your XML load failed that's why it went inside the if statement. Check whether the URL is spelled right or check if the URL really exists.
Try using libxml_use_internal_errors() to capture XML parsing errors:
<?php
$doc = new DOMDocument();
$doc->recover = true;
libxml_use_internal_errors(true);
$url = 'http://page2rss.com/rss/91a83628a27c43b6ab4f0b3959f69f5a';
$doc->load($url);
$errors = libxml_get_errors();
foreach ($errors as $error) {
printf("Error %d at line %d, column %d:\n\t%s\n",
$error->code, $error->line, $error->column, $error->message);
}
libxml_use_internal_errors(false);
// Error 9 at line 82, column 155:
// Input is not proper UTF-8, indicate encoding !
// Bytes: 0xAE 0x20 0x28 0x52

php simple html dom parser how to get the content of html tag

I am trying to get the specific tag content, but seems I am not able to do so using following function
<?PHP
include_once('simple_html_dom.php');
function read_page($url = 'http://google.com')
{
$doc = new DOMDocument();
$data = file_get_html($url);
$content = $data->find('div#footer');
print_r( $content);
}
read_page();
?>
Try $data->find('div[id="footer"]')

Echoing only a div with php

I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom;
throws
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10
To evaporate the warning, you can use libxml_use_internal_errors(true)
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);
I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:
link
Should be
link
$dom->#loadHTML($html);
This is incorrect, use this instead:
#$dom->loadHTML($html);
There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably an & (ampersand) used as parameter separator and not masked as entity with &).
You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "#" (http://www.php.net/manual/en/language.operators.errorcontrol.php )
#$dom->loadHTML($html);
The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.
You're probably looking for
echo $dom->saveHTML();
Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:
DOMNodeList Object
(
)
Solution
Set recover to true, and strictErrorChecking to false
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->recover = true;
$doc->strictErrorChecking = false;
$doc->loadHTML($content);
Use php's entity-encoding on the markup's contents, which is a most common error source.
replace the simple
$dom->loadHTML($html);
with the more robust ...
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
{
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));
echo $dom;
try this
I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:
$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);
function fixAmps(&$html, $offset) {
$positionAmp = strpos($html, '&', $offset);
$positionSemiColumn = strpos($html, ';', $positionAmp+1);
$string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);
if ($positionAmp !== false) { // If an '&' can be found.
if ($positionSemiColumn === false) { // If no ';' can be found.
$html = substr_replace($html, '&', $positionAmp, 1); // Replace straight away.
} else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
$html = substr_replace($html, '&', $positionAmp, 1); // This mean we need to escape the '&' sign.
fixAmps($html, $positionAmp+5); // Recursive call from the new position.
} else {
fixAmps($html, $positionAmp+1); // Recursive call from the new position.
}
}
}
Another possibile solution is
$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());
Another possibile solution is,maybe your file is ASCII type file,just change the type of your files.
Even after this my code is working fine , so i just removed all warning messages with this statement at line 1 .
<?php error_reporting(E_ERROR); ?>

PHP returning page error on simplexml print_r

The problem is only happening with one file when I try to do a DocumentDOM/SimpleXML method, so it seems like the issue is with that file. No clue what it could be.
If I do the following:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
print_r($xml);
in Chrome, I get a "Page Unavailable" error. In Firefox, I get nothing.
If I do the same thing but to a "test2.html", I get a print out as expected.
If I try the same thing but doing it this way:
$file = "test1.html";
$data = file_get_contents($file)
$dom = DOMDocument::loadHTML($data);
$xml = simplexml_import_dom($dom);
print_r($xml);
I get the same issue.
If I comment out the print_r line, Chrome goes from the "Page Unavailable" to blank.
I changed the permissions to 777, in case that was an issue, no fix.
I tried simply echoing out the contents of the html, no problem at all.
Any clues as to why a) Chrome would do that, and b) why I'm not getting any usable results?
Update:
If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
$xml = simplexml_import_dom($dom);
print_r($xml);
}
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
echo "Load!";
}
I get the "Load!" output, meaning that the dom method shouldn't be the problem (?)
I'll try the same exact test with the simplexml.
Update2:
If I do this:
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
}
I get "Load!" but if I do:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
print_r($xml);
}
I get the error. I did finally notice that I had an option to view the error in Chrome:
Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.
The troublesome html file is 288Kb. Could that be the issue? If so, how would I adjust for that?
Last Update:
Very Odd. I can use methods and functions on the object (as simplexml or domdocument), so I can do things like xpath to delete or parse the html, etc. In some cases (small results) it can echo out results, but for big stuff (show all spans), it fails in the same way.
So, since the end result, I think will fit in these parameters, I SHOULD be okay (I guess).
But any real solution is very welcome.
Turn on error reporting: error_reporting(E_ALL); in the first line of your PHP code.
Check the memory limit of your PHP configuration: memory_limit in the respective php.ini
What's the difference between test1.html and test2.html? Perhaps test1.html is not well-formed.
DocumentDOM and/or SimpleXML may bail out if the document is malformed. Try something like:
$dom = DOMDocument::loadHTMLFile($file);
if (!$dom) {
echo 'Loading file failed';
exit;
}
$xml = simplexml_import_dom($dom);
if (!$xml) {
...
}
If creating the $dom worked, conversion to $xml should work as well, but make sure anyway.
Edit: As Gehrig said, make sure error reporting is on, that should make it obvious where the process fails.

Categories