Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, - php

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom;
throws
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10

To evaporate the warning, you can use libxml_use_internal_errors(true)
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);

I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:
link
Should be
link

$dom->#loadHTML($html);
This is incorrect, use this instead:
#$dom->loadHTML($html);

There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably an & (ampersand) used as parameter separator and not masked as entity with &).
You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "#" (http://www.php.net/manual/en/language.operators.errorcontrol.php )
#$dom->loadHTML($html);

The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.
You're probably looking for
echo $dom->saveHTML();

Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:
DOMNodeList Object
(
)
Solution
Set recover to true, and strictErrorChecking to false
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->recover = true;
$doc->strictErrorChecking = false;
$doc->loadHTML($content);
Use php's entity-encoding on the markup's contents, which is a most common error source.

replace the simple
$dom->loadHTML($html);
with the more robust ...
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
{
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));
echo $dom;
try this

I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:
$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);
function fixAmps(&$html, $offset) {
$positionAmp = strpos($html, '&', $offset);
$positionSemiColumn = strpos($html, ';', $positionAmp+1);
$string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);
if ($positionAmp !== false) { // If an '&' can be found.
if ($positionSemiColumn === false) { // If no ';' can be found.
$html = substr_replace($html, '&', $positionAmp, 1); // Replace straight away.
} else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
$html = substr_replace($html, '&', $positionAmp, 1); // This mean we need to escape the '&' sign.
fixAmps($html, $positionAmp+5); // Recursive call from the new position.
} else {
fixAmps($html, $positionAmp+1); // Recursive call from the new position.
}
}
}

Another possibile solution is
$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());

Another possibile solution is,maybe your file is ASCII type file,just change the type of your files.

Even after this my code is working fine , so i just removed all warning messages with this statement at line 1 .
<?php error_reporting(E_ERROR); ?>

Related

PHP How to avoid this warning: DOMDocument::loadHTML(): Invalid char in CDATA

I'm trying to collect some info from a web service, but I'm having issues with the CDATA Section of a page, because everything goes right when I use something like this:
$url = 'http://www.example.com';
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach($doc->getElementsByTagName('h3') as $subtitle) {
echo $subtitle->textContent; //The output is the Subtitle/s.
}
But when the page contains CDATA sections there is a problem with this error on the line $doc->loadHTML($content).
Warning: DOMDocument::loadHTML(): Invalid char in CDATA
I've seen over here a solution that I tried to implement without any success.
function sanitize_html($content) {
if (!$content) return '';
$invalid_characters = '/[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]/';
return preg_replace($invalid_characters,'', $content);
}
$url = 'http://www.example.com';
$content = file_get_contents($url);
$cleanContent = sanitize_html($content);
$doc = new DOMDocument();
$doc->loadHTML($cleanContent); //Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
But I got this other error:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
What could be a good way to deal with the CDATA sections of a page? Greetings.
The solution is to - replace the & symbol with &
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>
Try adding PCLZIP before load IOFactory as shown:
require_once '/Classes/PHPExcel.php';
\PHPExcel_Settings::setZipClass(\PHPExcel_Settings::PCLZIP);
add libxml_use_internal_errors(true) and libxml_clear_errors() this work for me please click below to review code
https://i.stack.imgur.com/6MN4H.png

Getting Notice: Undefined offset error: 1 when using file_get_contents and explode

I am trying to figure out some things about getting data from an external page using the PHP file_get_contents function.
This is the PHP code I am trying to get to work:
$url = 'http://www.controller.com/listings/aircraft/for-sale/list/category/3/jet-aircraft/manufacturer/cessna/model/citation-mustang';
$content = file_get_contents($url);
$first_step = explode('<div class="listing">',$content);
$second_step = explode("</div>",$first_step[1]);
echo $second_step[0];
It's a simple code to get the content of the divs with class 'listing' to echo on a page. For one reason or another, I keep getting the
notice Undefined offset error: 1
and can't figure out a way to fix this. When I turn off error reporting, it just returns an empty page. I already read it has something to do with empty arrays or something, but not sure how to fix this.
Thanks in advance!
You can get element by class name using DOMDocument :
$url = 'http://www.controller.com/listings/aircraft/for-sale/list/category/3/jet-aircraft/manufacturer/cessna/model/citation-mustang';
$content = file_get_contents($url);
$doc = new DOMDocument();
if (!$doc->loadHTML($content)) {
die ('error');
}
$a = new DOMXPath($doc);
$class = 'listing';
$divs = $a->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $class ')]");
// $divs contains every divs with "listing" in his class
// you can get content like that :
foreach ($divs as $div) {
echo $div->nodeValue;
// or
echo $div->textContent;
}
More info with this question from stackoverflow : Get all elements by class name using DOMDocument

Trying to use DOMDocument::loadHTMLFile with a generated url

Im calling the DOMDocument::loadHTMLFile method using a url I built.
This is the code I used to build the url:
$url = "http://en.wikipedia.org".$path
The $path is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music
If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music the page returns fine, but if I try to use my generated path I am getting errors.
This is the code I'm currently working with:
foreach ($paths as $path)
{
echo $path; // will cause error
//echo $path = '/wiki/Pop_music'; // will work
$url = "http://en.wikipedia.org"."$path";
$doc = getHTML($url, 1);
if($doc !== false)
{
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
}
The getHTML function is:
function getHTML($url, $domainID)
{
$conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);
// Load HTML
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
// Update the time to show that the domain was crawled.
$sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
$conArtistsCrawler->query($sql);
$conArtistsCrawler->close();
// Delay 1 second after the request to avoid getting BANNED
sleep(1);
// Check to see if URL is valid
if($isSuccessful === false)
{
//URL invalid!
echo "\"".$url."\" is invalid<br>";
return false;
}
return $doc;
}
The code outputs:
With hardcoded path:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already
defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio
invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): Tag source invalid in
http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Pop music
With path variable:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in
http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music
Warning:
DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E):
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load
external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E"
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77 "http://en.wikipedia.org/wiki/Pop_music " is invalid
Short answer:
Well, the error you're getting is due to the fact that $doc is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML() is returning false.
So, lose the # operator, check what DOMDocument is complaining about
and debug from there.
Edit:
but I am still unsure why when I pass in the variable I get a
different result then when I hardcode it. When I echo both path values
or url values they look identical
They are certainly not identical. You have a <br/> tag after Pop_Music which makes the url invalid.
Long Answer
Running this script:
$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = #$doc->loadHTMLFile($url);
if ($success) {
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
produces the following result:
Pop music<br />
So, in order to troubleshoot your script there are a couple of things you should do...
Lose the # operator
Do not use # operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.
Best way to handle this is using a custom error handler and loading it
before DOMDocument.
This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.
Example:
set_error_handler(function($errno, $errstr, $errfile, $errline) {
//Digest error here
});
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
restore_error_handler();
Note: You can also use libxml_use_internal_errors(true);
Your getHTML function returns an inconsistent type
Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:
You're passing a boolean to XDOMPath instead of a DOMDocument object
to XDOMPath
Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.
example:
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
$xpath = new DOMXPath($doc);
}

Domdocument Load not loading

I am trying to load xml file through url(i.e. rss).
But when I Use
$doc = new DOMDocument();
$doc->load($url);
if($doc->load($url,LIBXML_NOWARNING)===false)
{
echo "Hello";
//echo #$doc->load($url,LIBXML_NOWARNING);
//exit;
$error = $doc->load($url);
print_r($error);exit;
}
It only prints Hello..
No warning displayed for line 2.
Please provide me solution that which error occurs as I am getting nothing.
Remove exit; from the code after echo "Hello"; that is the reason
That is because the rendered content is not visible. Try pressing Ctrl+U on your browser.
Also, instead of print_r try with var_dump
$error = $doc->load($url);
var_dump($error);
EDIT :
So it seems like your $doc->load failed in the first place. You need to change your if statement to
if($doc->load($url,LIBXML_NOWARNING)===true) // Replaced false with true.
or simply
if($doc->load($url,LIBXML_NOWARNING))
Your XML load failed that's why it went inside the if statement. Check whether the URL is spelled right or check if the URL really exists.
Try using libxml_use_internal_errors() to capture XML parsing errors:
<?php
$doc = new DOMDocument();
$doc->recover = true;
libxml_use_internal_errors(true);
$url = 'http://page2rss.com/rss/91a83628a27c43b6ab4f0b3959f69f5a';
$doc->load($url);
$errors = libxml_get_errors();
foreach ($errors as $error) {
printf("Error %d at line %d, column %d:\n\t%s\n",
$error->code, $error->line, $error->column, $error->message);
}
libxml_use_internal_errors(false);
// Error 9 at line 82, column 155:
// Input is not proper UTF-8, indicate encoding !
// Bytes: 0xAE 0x20 0x28 0x52

How to output DOMDocuments?

Maybe I am missing something... but the DOM Object is empty in this code:
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();
I really don't know what could be wrong with that code. I have verified that $input is actually filled with the html contents of the web page.
The output is:
object(DOMDocument)#3 (0) { }
I don't understand why...
This is expected behaviour. To see the HTML, use DOMDocument::saveHTML() (or saveXML()).
The output is: object(DOMDocument)#3 (0) { }
Yes. That's what a var_dumped DOMDocument looks like.
If you want to look at the HTML representation of the content inside the document, saveHTML() on it. That spits out a cleaned up version of the HTML on Google's home page for me.
Try this
$input = file_get_contents('http://www.google.com/');
$doc = new DOMDocument();
$test=#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($test);
die();
//output
//bool(true)
?>
or try
$input = file_get_contents('http://www.google.com/');
$buffer = ob_get_clean();
$tidy = new tidy();
$input = $tidy->repairString($input);
$doc = new DOMDocument();
#$doc->loadHTML($input); //supress errors on invalid html!
var_dump($doc);
die();

Categories