PHP How to avoid this warning: DOMDocument::loadHTML(): Invalid char in CDATA - php

I'm trying to collect some info from a web service, but I'm having issues with the CDATA Section of a page, because everything goes right when I use something like this:
$url = 'http://www.example.com';
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach($doc->getElementsByTagName('h3') as $subtitle) {
echo $subtitle->textContent; //The output is the Subtitle/s.
}
But when the page contains CDATA sections there is a problem with this error on the line $doc->loadHTML($content).
Warning: DOMDocument::loadHTML(): Invalid char in CDATA
I've seen over here a solution that I tried to implement without any success.
function sanitize_html($content) {
if (!$content) return '';
$invalid_characters = '/[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]/';
return preg_replace($invalid_characters,'', $content);
}
$url = 'http://www.example.com';
$content = file_get_contents($url);
$cleanContent = sanitize_html($content);
$doc = new DOMDocument();
$doc->loadHTML($cleanContent); //Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
But I got this other error:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
What could be a good way to deal with the CDATA sections of a page? Greetings.

The solution is to - replace the & symbol with &
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>

Try adding PCLZIP before load IOFactory as shown:
require_once '/Classes/PHPExcel.php';
\PHPExcel_Settings::setZipClass(\PHPExcel_Settings::PCLZIP);

add libxml_use_internal_errors(true) and libxml_clear_errors() this work for me please click below to review code
https://i.stack.imgur.com/6MN4H.png

Related

Code to know where visitors come from with error simplexml_load_string():

i'm receiving this errors because of a php code
PHP Warning: simplexml_load_string(): Entity: line 42: parser error : Premature end of data in tag meta line 4 in /home/*****/public_html/wp-config.php on line 19
PHP Warning: simplexml_load_string(): in /home/*****/public_html/wp-config.php on line 19
PHP Warning: simplexml_load_string(): ^ in /home/*****/public_html/wp-config.php on line 19
PHP Warning: simplexml_load_string(): Entity: line 37: parser error : StartTag: invalid element name in /home/*****/public_html/wp-config.php on line 19
PHP Warning: simplexml_load_string(): Entity: line 40: parser error : Opening and ending tag mismatch: script line 34 and body in /home/*****/public_html/wp-config.php on line 19
PHP Warning: simplexml_load_string(): Entity: line 42: parser error : Premature end of data in tag head line 3 in /home/*****/public_html/wp-config.php on line 19
This is the code:
<?php
define( 'WP_CACHE', true ); // Added by WP Rocket
define('WP_AUTO_UPDATE_CORE', 'minor');// This setting is required to make sure that WordPress updates can be properly managed in WordPress Toolkit. Remove this line if this WordPress website is not managed by WordPress Toolkit anymore.
// Added by WP Rocket
function convertXML($xml_content){
if(is_object($xml_content)){
foreach ($xml_content as $key => $value){
$xml_content[$key] = $value;
}
}
else {
$xml_content = $xml;
}
return $xml_content;
}
$IP_ADDR = $_SERVER['REMOTE_ADDR'];
$xml_get = file_get_contents("http://freegeoip.net/xml/$IP_ADDR");
$xml_content = simplexml_load_string($xml_get);
$xml_convert = convertXML($xml_content);
if($xml_convert['CountryCode'] != 'BR' or $xml_convert['CountryCode'] != 'US'){
$block_cmd = "\r\n deny from $IP_ADDR \r\n";
$include = 'testando.html';
$open = fopen($include, 'a');
fwrite($open,$block_cmd);
fclose($open);
}else{
}
/** Enable W3 Total Cache Edge Mode */
define('W3TC_EDGE_MODE', true); // Added by W3 Total Cache
/**
* The base configurations of the WordPress.
I tried modifying the line 19 ($IP_ADDR = $_SERVER['REMOTE_ADDR'];) in several ways but the error keeps alive.. i don't know whats happening
This code is for to know where visitors of your website are coming from. I tried modifying the line 19 in several ways, also tried to add this function to the code that i saw in a topic
function sxe($url)
{
$xml = file_get_contents($url);
foreach ($http_response_header as $header)
{
if (preg_match('#^Content-Type: text/xml; charset=(.*)#i', $header, $m))
{
switch (strtolower($m[1]))
{
case 'utf-8':
// do nothing
break;
case 'iso-8859-1':
$xml = utf8_encode($xml);
break;
default:
$xml = iconv($m[1], 'utf-8', $xml);
}
break;
}
}
return simplexml_load_string($xml);
}
but this didn't worked also.
I would take a closer look at what you're actually being sent from freegeoip.net.
If it is indeed HTML you're getting back from freegeoip.net, but has the information you want, you could try
$dom = new DOMDocument;
$dom->loadHTML($xml_get);
$xml_content = simplexml_import_dom($dom);

Trying to use DOMDocument::loadHTMLFile with a generated url

Im calling the DOMDocument::loadHTMLFile method using a url I built.
This is the code I used to build the url:
$url = "http://en.wikipedia.org".$path
The $path is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music
If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music the page returns fine, but if I try to use my generated path I am getting errors.
This is the code I'm currently working with:
foreach ($paths as $path)
{
echo $path; // will cause error
//echo $path = '/wiki/Pop_music'; // will work
$url = "http://en.wikipedia.org"."$path";
$doc = getHTML($url, 1);
if($doc !== false)
{
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
}
The getHTML function is:
function getHTML($url, $domainID)
{
$conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);
// Load HTML
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
// Update the time to show that the domain was crawled.
$sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
$conArtistsCrawler->query($sql);
$conArtistsCrawler->close();
// Delay 1 second after the request to avoid getting BANNED
sleep(1);
// Check to see if URL is valid
if($isSuccessful === false)
{
//URL invalid!
echo "\"".$url."\" is invalid<br>";
return false;
}
return $doc;
}
The code outputs:
With hardcoded path:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already
defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio
invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): Tag source invalid in
http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Pop music
With path variable:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in
http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music
Warning:
DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E):
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load
external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E"
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77 "http://en.wikipedia.org/wiki/Pop_music " is invalid
Short answer:
Well, the error you're getting is due to the fact that $doc is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML() is returning false.
So, lose the # operator, check what DOMDocument is complaining about
and debug from there.
Edit:
but I am still unsure why when I pass in the variable I get a
different result then when I hardcode it. When I echo both path values
or url values they look identical
They are certainly not identical. You have a <br/> tag after Pop_Music which makes the url invalid.
Long Answer
Running this script:
$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = #$doc->loadHTMLFile($url);
if ($success) {
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
produces the following result:
Pop music<br />
So, in order to troubleshoot your script there are a couple of things you should do...
Lose the # operator
Do not use # operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.
Best way to handle this is using a custom error handler and loading it
before DOMDocument.
This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.
Example:
set_error_handler(function($errno, $errstr, $errfile, $errline) {
//Digest error here
});
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
restore_error_handler();
Note: You can also use libxml_use_internal_errors(true);
Your getHTML function returns an inconsistent type
Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:
You're passing a boolean to XDOMPath instead of a DOMDocument object
to XDOMPath
Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.
example:
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
$xpath = new DOMXPath($doc);
}

PHP DomDocument failing to handle quotes in a url

When I try to open a url like that :
http://api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1
containing a quote with the browser everything works fine and the output is good as an xml
But when I try to call it from a php file:
$url = "http:/api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1"
//using DOMDocument for parsing.
$data = new DOMDocument();
// loading the xml from Anghami API.
if($data->load("$url")){// Getting the Tag song.
foreach ($data->getElementsByTagName('song') as $searchNode)
{
$count++;
$n++;
//Getting the information of Anghami Song from the XML file.
$valueID = $searchNode->getAttribute('id');
$titleAnghami = $searchNode->getAttribute('title');
$album = $searchNode->getAttribute('album');
$albumID = $searchNode->getAttribute('albumID');
$artistAnghami = $searchNode->getAttribute('artist');
$track = $searchNode->getAttribute('track');
$year = $searchNode->getAttribute('year');
$coverArt = $searchNode->getAttribute('coverArt');
$ArtistArt = $searchNode->getAttribute('ArtistArt');
$size = $searchNode->getAttribute('size');
}
}
I get this error:
'Warning: DOMDocument::load(): I/O warning : failed to load external entity /var/www/html/http:/api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1" in /var/www/html/search.php on line 93'
Can anyone help please?
#Fracsi is correct: the URL needs to start with http:// not http:/
The other problem is that the XML has a default namespace (defined with the xmlns attribute on the root element), so you need to use
$data->getElementsByTagNameNS('http://api.anghami.com/rest/v1', 'song')
to select all the "song" elements.

beginner attempting to read xml into php

I have an xml feed located here that I am trying to read into a php script, then cycle through the <packages>, and sum the <downloads>. I've attempted to do this using DOMDocument, but have thus far failed.
the basic method i've been trying to use is as follows
<?php
$dom = new DomDocument;
$dom->loadXML('http://www.phogue.net/feed');
$packages = $dom->getElementsByTagName('package');
foreach($packages as $item)
{
echo $item->getAttribute('uid').'<br>';
}
?>
The above code is meant to just print out the name of each item, but its not working. I am currently getting the following error
Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, '<' not found in Entity, line: 1 in /home/a8744502/public_html/userbar.php on line 3
WORKING CODE:
<?php
$dom = new DomDocument;
$dom->load('http://www.phogue.net/feed/');
$package = $dom->getElementsByTagName('package');
$value=0;
foreach ($package as $plugin) {
$downloads = $plugin->getElementsByTagName("downloads");
$download = $downloads->item(0)->nodeValue;
$authors = $plugin->getElementsByTagName("author");
$author = $authors->item(0)->nodeValue;
if($author == "Zaeed")
{
$value += $download;
}
}
echo $value;
?>
DOMDocument::loadXML() expects a string of XML. Try DOMDocument::load() instead - http://www.php.net/manual/en/domdocument.load.php
Keep in mind that to open an XML file via HTTP, you will need the appropriate wrapper enabled.
You have a open parenthesis at the beginning of your echo.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom;
throws
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10
To evaporate the warning, you can use libxml_use_internal_errors(true)
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);
I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:
link
Should be
link
$dom->#loadHTML($html);
This is incorrect, use this instead:
#$dom->loadHTML($html);
There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably an & (ampersand) used as parameter separator and not masked as entity with &).
You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "#" (http://www.php.net/manual/en/language.operators.errorcontrol.php )
#$dom->loadHTML($html);
The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.
You're probably looking for
echo $dom->saveHTML();
Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:
DOMNodeList Object
(
)
Solution
Set recover to true, and strictErrorChecking to false
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->recover = true;
$doc->strictErrorChecking = false;
$doc->loadHTML($content);
Use php's entity-encoding on the markup's contents, which is a most common error source.
replace the simple
$dom->loadHTML($html);
with the more robust ...
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
{
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));
echo $dom;
try this
I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:
$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);
function fixAmps(&$html, $offset) {
$positionAmp = strpos($html, '&', $offset);
$positionSemiColumn = strpos($html, ';', $positionAmp+1);
$string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);
if ($positionAmp !== false) { // If an '&' can be found.
if ($positionSemiColumn === false) { // If no ';' can be found.
$html = substr_replace($html, '&', $positionAmp, 1); // Replace straight away.
} else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
$html = substr_replace($html, '&', $positionAmp, 1); // This mean we need to escape the '&' sign.
fixAmps($html, $positionAmp+5); // Recursive call from the new position.
} else {
fixAmps($html, $positionAmp+1); // Recursive call from the new position.
}
}
}
Another possibile solution is
$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());
Another possibile solution is,maybe your file is ASCII type file,just change the type of your files.
Even after this my code is working fine , so i just removed all warning messages with this statement at line 1 .
<?php error_reporting(E_ERROR); ?>

Categories