Domdocument Load not loading - php

I am trying to load xml file through url(i.e. rss).
But when I Use
$doc = new DOMDocument();
$doc->load($url);
if($doc->load($url,LIBXML_NOWARNING)===false)
{
echo "Hello";
//echo #$doc->load($url,LIBXML_NOWARNING);
//exit;
$error = $doc->load($url);
print_r($error);exit;
}
It only prints Hello..
No warning displayed for line 2.
Please provide me solution that which error occurs as I am getting nothing.

Remove exit; from the code after echo "Hello"; that is the reason

That is because the rendered content is not visible. Try pressing Ctrl+U on your browser.
Also, instead of print_r try with var_dump
$error = $doc->load($url);
var_dump($error);
EDIT :
So it seems like your $doc->load failed in the first place. You need to change your if statement to
if($doc->load($url,LIBXML_NOWARNING)===true) // Replaced false with true.
or simply
if($doc->load($url,LIBXML_NOWARNING))
Your XML load failed that's why it went inside the if statement. Check whether the URL is spelled right or check if the URL really exists.

Try using libxml_use_internal_errors() to capture XML parsing errors:
<?php
$doc = new DOMDocument();
$doc->recover = true;
libxml_use_internal_errors(true);
$url = 'http://page2rss.com/rss/91a83628a27c43b6ab4f0b3959f69f5a';
$doc->load($url);
$errors = libxml_get_errors();
foreach ($errors as $error) {
printf("Error %d at line %d, column %d:\n\t%s\n",
$error->code, $error->line, $error->column, $error->message);
}
libxml_use_internal_errors(false);
// Error 9 at line 82, column 155:
// Input is not proper UTF-8, indicate encoding !
// Bytes: 0xAE 0x20 0x28 0x52

Related

How to to skip invalid XML file with incomplete closing tags in PHP

I am using PHP DOM Xpath to read XML files. In some cases tags are not properly closed like below
<data>
<name> value </name>
<address
I have following code to check if XML is valid
$doc = new DOMDocument();
if(!$doc->load(test.xml))
{
foreach (libxml_get_errors() as $error)
{
print_r($error);
}
libxml_clear_errors();
}
else
{
$valid_xml = 'y';
}
if($valid_xml=='y')
// then process XML
else
// skip and take next file
but I am getting below errors at line if(!$doc->load(test.xml))
Message: DOMDocument::load(): Couldn't find end of Start Tag AdjustmentsToReconcile
Message: DOMDocument::load(): Premature end of data in tag
You were almost there. Try adding the command libxml_use_internal_errors(true); before everything to tell PHP not to throw errors but to cache them for you to iterate through as your code is doing.
That should helps you:
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$valid_xml = true;
if(!$doc->load(test.xml))
{
$valid_xml = (count(libxml_get_errors()) === 0);
libxml_clear_errors();
}
if($valid_xml)
// then process XML
else
// skip and take next file
libxml_use_internal_errors is the key.

Trying to use DOMDocument::loadHTMLFile with a generated url

Im calling the DOMDocument::loadHTMLFile method using a url I built.
This is the code I used to build the url:
$url = "http://en.wikipedia.org".$path
The $path is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music
If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music the page returns fine, but if I try to use my generated path I am getting errors.
This is the code I'm currently working with:
foreach ($paths as $path)
{
echo $path; // will cause error
//echo $path = '/wiki/Pop_music'; // will work
$url = "http://en.wikipedia.org"."$path";
$doc = getHTML($url, 1);
if($doc !== false)
{
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
}
The getHTML function is:
function getHTML($url, $domainID)
{
$conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);
// Load HTML
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
// Update the time to show that the domain was crawled.
$sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
$conArtistsCrawler->query($sql);
$conArtistsCrawler->close();
// Delay 1 second after the request to avoid getting BANNED
sleep(1);
// Check to see if URL is valid
if($isSuccessful === false)
{
//URL invalid!
echo "\"".$url."\" is invalid<br>";
return false;
}
return $doc;
}
The code outputs:
With hardcoded path:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already
defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio
invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): Tag source invalid in
http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Pop music
With path variable:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in
http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music
Warning:
DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E):
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load
external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E"
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77 "http://en.wikipedia.org/wiki/Pop_music " is invalid
Short answer:
Well, the error you're getting is due to the fact that $doc is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML() is returning false.
So, lose the # operator, check what DOMDocument is complaining about
and debug from there.
Edit:
but I am still unsure why when I pass in the variable I get a
different result then when I hardcode it. When I echo both path values
or url values they look identical
They are certainly not identical. You have a <br/> tag after Pop_Music which makes the url invalid.
Long Answer
Running this script:
$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = #$doc->loadHTMLFile($url);
if ($success) {
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
produces the following result:
Pop music<br />
So, in order to troubleshoot your script there are a couple of things you should do...
Lose the # operator
Do not use # operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.
Best way to handle this is using a custom error handler and loading it
before DOMDocument.
This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.
Example:
set_error_handler(function($errno, $errstr, $errfile, $errline) {
//Digest error here
});
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
restore_error_handler();
Note: You can also use libxml_use_internal_errors(true);
Your getHTML function returns an inconsistent type
Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:
You're passing a boolean to XDOMPath instead of a DOMDocument object
to XDOMPath
Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.
example:
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
$xpath = new DOMXPath($doc);
}

How to output DOMDocuments?

Maybe I am missing something... but the DOM Object is empty in this code:
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();
I really don't know what could be wrong with that code. I have verified that $input is actually filled with the html contents of the web page.
The output is:
object(DOMDocument)#3 (0) { }
I don't understand why...
This is expected behaviour. To see the HTML, use DOMDocument::saveHTML() (or saveXML()).
The output is: object(DOMDocument)#3 (0) { }
Yes. That's what a var_dumped DOMDocument looks like.
If you want to look at the HTML representation of the content inside the document, saveHTML() on it. That spits out a cleaned up version of the HTML on Google's home page for me.
Try this
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
$test=#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($test);
die();
//output
//bool(true)
?>
or try
$input = file_get_contents('http://www.google.com/');
$buffer = ob_get_clean();
$tidy = new tidy();
$input = $tidy->repairString($input);
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom;
throws
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10
To evaporate the warning, you can use libxml_use_internal_errors(true)
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);
I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:
link
Should be
link
$dom->#loadHTML($html);
This is incorrect, use this instead:
#$dom->loadHTML($html);
There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably an & (ampersand) used as parameter separator and not masked as entity with &).
You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "#" (http://www.php.net/manual/en/language.operators.errorcontrol.php )
#$dom->loadHTML($html);
The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.
You're probably looking for
echo $dom->saveHTML();
Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:
DOMNodeList Object
(
)
Solution
Set recover to true, and strictErrorChecking to false
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->recover = true;
$doc->strictErrorChecking = false;
$doc->loadHTML($content);
Use php's entity-encoding on the markup's contents, which is a most common error source.
replace the simple
$dom->loadHTML($html);
with the more robust ...
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
{
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));
echo $dom;
try this
I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:
$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);
function fixAmps(&$html, $offset) {
$positionAmp = strpos($html, '&', $offset);
$positionSemiColumn = strpos($html, ';', $positionAmp+1);
$string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);
if ($positionAmp !== false) { // If an '&' can be found.
if ($positionSemiColumn === false) { // If no ';' can be found.
$html = substr_replace($html, '&', $positionAmp, 1); // Replace straight away.
} else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
$html = substr_replace($html, '&', $positionAmp, 1); // This mean we need to escape the '&' sign.
fixAmps($html, $positionAmp+5); // Recursive call from the new position.
} else {
fixAmps($html, $positionAmp+1); // Recursive call from the new position.
}
}
}
Another possibile solution is
$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());
Another possibile solution is,maybe your file is ASCII type file,just change the type of your files.
Even after this my code is working fine , so i just removed all warning messages with this statement at line 1 .
<?php error_reporting(E_ERROR); ?>

PHP returning page error on simplexml print_r

The problem is only happening with one file when I try to do a DocumentDOM/SimpleXML method, so it seems like the issue is with that file. No clue what it could be.
If I do the following:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
print_r($xml);
in Chrome, I get a "Page Unavailable" error. In Firefox, I get nothing.
If I do the same thing but to a "test2.html", I get a print out as expected.
If I try the same thing but doing it this way:
$file = "test1.html";
$data = file_get_contents($file)
$dom = DOMDocument::loadHTML($data);
$xml = simplexml_import_dom($dom);
print_r($xml);
I get the same issue.
If I comment out the print_r line, Chrome goes from the "Page Unavailable" to blank.
I changed the permissions to 777, in case that was an issue, no fix.
I tried simply echoing out the contents of the html, no problem at all.
Any clues as to why a) Chrome would do that, and b) why I'm not getting any usable results?
Update:
If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
$xml = simplexml_import_dom($dom);
print_r($xml);
}
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
echo "Load!";
}
I get the "Load!" output, meaning that the dom method shouldn't be the problem (?)
I'll try the same exact test with the simplexml.
Update2:
If I do this:
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
}
I get "Load!" but if I do:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
print_r($xml);
}
I get the error. I did finally notice that I had an option to view the error in Chrome:
Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.
The troublesome html file is 288Kb. Could that be the issue? If so, how would I adjust for that?
Last Update:
Very Odd. I can use methods and functions on the object (as simplexml or domdocument), so I can do things like xpath to delete or parse the html, etc. In some cases (small results) it can echo out results, but for big stuff (show all spans), it fails in the same way.
So, since the end result, I think will fit in these parameters, I SHOULD be okay (I guess).
But any real solution is very welcome.
Turn on error reporting: error_reporting(E_ALL); in the first line of your PHP code.
Check the memory limit of your PHP configuration: memory_limit in the respective php.ini
What's the difference between test1.html and test2.html? Perhaps test1.html is not well-formed.
DocumentDOM and/or SimpleXML may bail out if the document is malformed. Try something like:
$dom = DOMDocument::loadHTMLFile($file);
if (!$dom) {
echo 'Loading file failed';
exit;
}
$xml = simplexml_import_dom($dom);
if (!$xml) {
...
}
If creating the $dom worked, conversion to $xml should work as well, but make sure anyway.
Edit: As Gehrig said, make sure error reporting is on, that should make it obvious where the process fails.

Categories