Trying to use DOMDocument::loadHTMLFile with a generated url

Trying to use DOMDocument::loadHTMLFile with a generated url - php

Im calling the DOMDocument::loadHTMLFile method using a url I built.
This is the code I used to build the url:
$url = "http://en.wikipedia.org".$path
The $path is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music
If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music the page returns fine, but if I try to use my generated path I am getting errors.
This is the code I'm currently working with:
foreach ($paths as $path)
{
echo $path; // will cause error
//echo $path = '/wiki/Pop_music'; // will work
$url = "http://en.wikipedia.org"."$path";
$doc = getHTML($url, 1);
if($doc !== false)
{
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
}
The getHTML function is:
function getHTML($url, $domainID)
{
$conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);
// Load HTML
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
// Update the time to show that the domain was crawled.
$sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
$conArtistsCrawler->query($sql);
$conArtistsCrawler->close();
// Delay 1 second after the request to avoid getting BANNED
sleep(1);
// Check to see if URL is valid
if($isSuccessful === false)
{
//URL invalid!
echo "\"".$url."\" is invalid<br>";
return false;
}
return $doc;
}
The code outputs:
With hardcoded path:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already
defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio
invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): Tag source invalid in
http://en.wikipedia.org/wiki/Pop_music, line: 225 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Pop music
With path variable:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in
http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in
/Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
/wiki/Pop_music
Warning:
DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E):
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load
external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E"
in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line
77 "http://en.wikipedia.org/wiki/Pop_music " is invalid

Short answer:
Well, the error you're getting is due to the fact that $doc is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML() is returning false.
So, lose the # operator, check what DOMDocument is complaining about
and debug from there.
Edit:
but I am still unsure why when I pass in the variable I get a
different result then when I hardcode it. When I echo both path values
or url values they look identical
They are certainly not identical. You have a <br/> tag after Pop_Music which makes the url invalid.
Long Answer
Running this script:
$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = #$doc->loadHTMLFile($url);
if ($success) {
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[#id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
produces the following result:
Pop music<br />
So, in order to troubleshoot your script there are a couple of things you should do...
Lose the # operator
Do not use # operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.
Best way to handle this is using a custom error handler and loading it
before DOMDocument.
This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.
Example:
set_error_handler(function($errno, $errstr, $errfile, $errline) {
//Digest error here
});
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
restore_error_handler();
Note: You can also use libxml_use_internal_errors(true);
Your getHTML function returns an inconsistent type
Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:
You're passing a boolean to XDOMPath instead of a DOMDocument object
to XDOMPath
Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.
example:
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
$xpath = new DOMXPath($doc);
}

Related

PHP How to avoid this warning: DOMDocument::loadHTML(): Invalid char in CDATA

I'm trying to collect some info from a web service, but I'm having issues with the CDATA Section of a page, because everything goes right when I use something like this:
$url = 'http://www.example.com';
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach($doc->getElementsByTagName('h3') as $subtitle) {
echo $subtitle->textContent; //The output is the Subtitle/s.
}
But when the page contains CDATA sections there is a problem with this error on the line $doc->loadHTML($content).
Warning: DOMDocument::loadHTML(): Invalid char in CDATA
I've seen over here a solution that I tried to implement without any success.
function sanitize_html($content) {
if (!$content) return '';
$invalid_characters = '/[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]/';
return preg_replace($invalid_characters,'', $content);
}
$url = 'http://www.example.com';
$content = file_get_contents($url);
$cleanContent = sanitize_html($content);
$doc = new DOMDocument();
$doc->loadHTML($cleanContent); //Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
But I got this other error:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
What could be a good way to deal with the CDATA sections of a page? Greetings.

The solution is to - replace the & symbol with &
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>

Try adding PCLZIP before load IOFactory as shown:
require_once '/Classes/PHPExcel.php';
\PHPExcel_Settings::setZipClass(\PHPExcel_Settings::PCLZIP);

add libxml_use_internal_errors(true) and libxml_clear_errors() this work for me please click below to review code
https://i.stack.imgur.com/6MN4H.png

Domdocument Load not loading

I am trying to load xml file through url(i.e. rss).
But when I Use
$doc = new DOMDocument();
$doc->load($url);
if($doc->load($url,LIBXML_NOWARNING)===false)
{
echo "Hello";
//echo #$doc->load($url,LIBXML_NOWARNING);
//exit;
$error = $doc->load($url);
print_r($error);exit;
}
It only prints Hello..
No warning displayed for line 2.
Please provide me solution that which error occurs as I am getting nothing.

Remove exit; from the code after echo "Hello"; that is the reason

That is because the rendered content is not visible. Try pressing Ctrl+U on your browser.
Also, instead of print_r try with var_dump
$error = $doc->load($url);
var_dump($error);
EDIT :
So it seems like your $doc->load failed in the first place. You need to change your if statement to
if($doc->load($url,LIBXML_NOWARNING)===true) // Replaced false with true.
or simply
if($doc->load($url,LIBXML_NOWARNING))
Your XML load failed that's why it went inside the if statement. Check whether the URL is spelled right or check if the URL really exists.

Try using libxml_use_internal_errors() to capture XML parsing errors:
<?php
$doc = new DOMDocument();
$doc->recover = true;
libxml_use_internal_errors(true);
$url = 'http://page2rss.com/rss/91a83628a27c43b6ab4f0b3959f69f5a';
$doc->load($url);
$errors = libxml_get_errors();
foreach ($errors as $error) {
printf("Error %d at line %d, column %d:\n\t%s\n",
$error->code, $error->line, $error->column, $error->message);
}
libxml_use_internal_errors(false);
// Error 9 at line 82, column 155:
// Input is not proper UTF-8, indicate encoding !
// Bytes: 0xAE 0x20 0x28 0x52

beginner attempting to read xml into php

I have an xml feed located here that I am trying to read into a php script, then cycle through the <packages>, and sum the <downloads>. I've attempted to do this using DOMDocument, but have thus far failed.
the basic method i've been trying to use is as follows
<?php
$dom = new DomDocument;
$dom->loadXML('http://www.phogue.net/feed');
$packages = $dom->getElementsByTagName('package');
foreach($packages as $item)
{
echo $item->getAttribute('uid').'<br>';
}
?>
The above code is meant to just print out the name of each item, but its not working. I am currently getting the following error
Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, '<' not found in Entity, line: 1 in /home/a8744502/public_html/userbar.php on line 3
WORKING CODE:
<?php
$dom = new DomDocument;
$dom->load('http://www.phogue.net/feed/');
$package = $dom->getElementsByTagName('package');
$value=0;
foreach ($package as $plugin) {
$downloads = $plugin->getElementsByTagName("downloads");
$download = $downloads->item(0)->nodeValue;
$authors = $plugin->getElementsByTagName("author");
$author = $authors->item(0)->nodeValue;
if($author == "Zaeed")
{
$value += $download;
}
}
echo $value;
?>

DOMDocument::loadXML() expects a string of XML. Try DOMDocument::load() instead - http://www.php.net/manual/en/domdocument.load.php
Keep in mind that to open an XML file via HTTP, you will need the appropriate wrapper enabled.

You have a open parenthesis at the beginning of your echo.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom;
throws
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10

To evaporate the warning, you can use libxml_use_internal_errors(true)
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);

I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:
link
Should be
link

$dom->#loadHTML($html);
This is incorrect, use this instead:
#$dom->loadHTML($html);

There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably an & (ampersand) used as parameter separator and not masked as entity with &).
You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "#" (http://www.php.net/manual/en/language.operators.errorcontrol.php )
#$dom->loadHTML($html);

The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.
You're probably looking for
echo $dom->saveHTML();

Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:
DOMNodeList Object
(
)
Solution
Set recover to true, and strictErrorChecking to false
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->recover = true;
$doc->strictErrorChecking = false;
$doc->loadHTML($content);
Use php's entity-encoding on the markup's contents, which is a most common error source.

replace the simple
$dom->loadHTML($html);
with the more robust ...
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
{
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}

$html = file_get_contents("http://www.somesite.com/");
$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));
echo $dom;
try this

I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:
$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);
function fixAmps(&$html, $offset) {
$positionAmp = strpos($html, '&', $offset);
$positionSemiColumn = strpos($html, ';', $positionAmp+1);
$string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);
if ($positionAmp !== false) { // If an '&' can be found.
if ($positionSemiColumn === false) { // If no ';' can be found.
$html = substr_replace($html, '&', $positionAmp, 1); // Replace straight away.
} else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
$html = substr_replace($html, '&', $positionAmp, 1); // This mean we need to escape the '&' sign.
fixAmps($html, $positionAmp+5); // Recursive call from the new position.
} else {
fixAmps($html, $positionAmp+1); // Recursive call from the new position.
}
}
}

Another possibile solution is
$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());

Another possibile solution is,maybe your file is ASCII type file,just change the type of your files.

Even after this my code is working fine , so i just removed all warning messages with this statement at line 1 .
<?php error_reporting(E_ERROR); ?>

PHP returning page error on simplexml print_r

The problem is only happening with one file when I try to do a DocumentDOM/SimpleXML method, so it seems like the issue is with that file. No clue what it could be.
If I do the following:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
print_r($xml);
in Chrome, I get a "Page Unavailable" error. In Firefox, I get nothing.
If I do the same thing but to a "test2.html", I get a print out as expected.
If I try the same thing but doing it this way:
$file = "test1.html";
$data = file_get_contents($file)
$dom = DOMDocument::loadHTML($data);
$xml = simplexml_import_dom($dom);
print_r($xml);
I get the same issue.
If I comment out the print_r line, Chrome goes from the "Page Unavailable" to blank.
I changed the permissions to 777, in case that was an issue, no fix.
I tried simply echoing out the contents of the html, no problem at all.
Any clues as to why a) Chrome would do that, and b) why I'm not getting any usable results?
Update:
If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
$xml = simplexml_import_dom($dom);
print_r($xml);
}
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
echo "Load!";
}
I get the "Load!" output, meaning that the dom method shouldn't be the problem (?)
I'll try the same exact test with the simplexml.
Update2:
If I do this:
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
}
I get "Load!" but if I do:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
print_r($xml);
}
I get the error. I did finally notice that I had an option to view the error in Chrome:
Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.
The troublesome html file is 288Kb. Could that be the issue? If so, how would I adjust for that?
Last Update:
Very Odd. I can use methods and functions on the object (as simplexml or domdocument), so I can do things like xpath to delete or parse the html, etc. In some cases (small results) it can echo out results, but for big stuff (show all spans), it fails in the same way.
So, since the end result, I think will fit in these parameters, I SHOULD be okay (I guess).
But any real solution is very welcome.

Turn on error reporting: error_reporting(E_ALL); in the first line of your PHP code.
Check the memory limit of your PHP configuration: memory_limit in the respective php.ini
What's the difference between test1.html and test2.html? Perhaps test1.html is not well-formed.

DocumentDOM and/or SimpleXML may bail out if the document is malformed. Try something like:
$dom = DOMDocument::loadHTMLFile($file);
if (!$dom) {
echo 'Loading file failed';
exit;
}
$xml = simplexml_import_dom($dom);
if (!$xml) {
...
}
If creating the $dom worked, conversion to $xml should work as well, but make sure anyway.
Edit: As Gehrig said, make sure error reporting is on, that should make it obvious where the process fails.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to use DOMDocument::loadHTMLFile with a generated url - php

Related

PHP How to avoid this warning: DOMDocument::loadHTML(): Invalid char in CDATA

Domdocument Load not loading

beginner attempting to read xml into php

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

PHP returning page error on simplexml print_r

Categories

Resources