PHP DOMDocument->getElementByID adding  in place of empty <span> - php

I'm using PHP's DOMDocument object to parse some HTML (fetched with cURL). When I get an element by ID and output it, any empty <span> </span> tags get an additional character and become <span>Â </span>.
The Code:
<?php
$document = new DOMDocument();
$document->validateOnParse = true;
$document->loadHTML( curl_exec($handle) );
curl_close($handle);
$element = $document->getElementById( __ELEMENT_ID__ );
echo $document->saveHTML();
echo $document->saveHTML($element);
?>
The $document->saveHTML() command behaves as expected and prints out the entire page. BUT, like I say above, on the echo $document->saveHTML($element) command transforms empty <span> tags into <span>Â </span>.
This happens to all <span> </span> tags within $element.
What in this process (of getting the element by ID and outputting the element) is inserting this extra character? I'm could work around it, but I'm more interested in getting to the root.

I was able to fix the problem by setting the character encoding of the page. The page I was fetching did not have a defined character encoding, and my page was just a snippet without defined header info. When I added
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
The problem disappeared.

Related

Why does DOMDocument::saveHTML()'s behavior differ in encoding UTF-8 as entities in style & script elements?

Given a DOMDocument constructed with a stylesheet that contains an emoji character like so:
$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><head><meta charset=utf-8><style>span::before{ content: \"⚡️\"; }</style></head><body><span></span></body></html>" );
I've found some strange behavior when serializing the DOM back out to HTML.
If I do $dom->saveHTML( $dom->documentElement ) then I get (as desired):
<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>
However, if I instead do $dom->saveHTML() to save the entire document I get (erroneously):
<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>
Notice how the emoji “⚡️” is encoded as the HTML entities ⚡️ inside of the stylesheet, and browsers do not like this and it is treated as a literal string since CSS escape \26A1 should be used in instead.
I tried setting $dom->substituteEntities = false but without any effect.
The same HTML entity conversion is also happening inside of script tags, which causes similar problems in browsers.
Test via online PHP shell: https://3v4l.org/jMfDd

Get text with PHP Simple HTML DOM Parser

i'm using PHP Simple HTML DOM Parser to get text from a webpage.
The page i need to manipulate is something like:
<html>
<head>
<title>title</title>
<body>
<div id="content">
<h1>HELLO</h1>
Hello, world!
</div>
</body>
</html>
I need to get the h1 element and the text that has no tags.
to get the h1 i use this code:
$html = file_get_html("remote_page.html");
foreach($html->find('#content') as $text){
echo "H1: ".$text->find('h1', 0)->plaintext;
}
But the other text?
I also tried this into the foreach but i get the full text:
$text->plaintext;
but it returned also the H1 tag...
It looks like $text->find('text',2); gets what you're looking for, however I'm not sure how well that will work when the amount of text nodes is unknown. I'll keep looking.
You can simply strip html tags using strip_tags
<?php
strip_tags($input, '<br>');
?>
Use strip tags, as #Peachy pointed out. However, passing it a second argument <br> means string will ignore <br> tags, which is unnecessary. In your case,
<?php
strip_tags($text);
?>
would work as you'd like, given that you are only selecting content in the content id.
Try it
echo "H1: ".$text->find('h1', 0)->innertext;

Preserve utf8 when loading HTML from file

Well, apparently, PHP and it's standard libraries have some problems, and DOMDocument isn't an exception.
There are workarounds for utf8 characters when loading HTML string - $dom->loadHTML().
Apparently, I haven't found a way to do this when loading HTML from file - $dom->loadHTMLFile(). While it reads and sets the encoding from <meta /> tags, the problem strikes back if I haven't defined those. For instance, when loading a fragment of HTML (template part, like, footer.html), not a fully built HTML document.
So, how do I preserve utf8 characters, when loading HTML from file, that hasn't got it's <meta /> keys present, and defining those is not an option?
Update
footer.html (the file is encoded in UTF-8 without BOM):
<div id="footer">
<p>My sūpēr ōzōm ūtf8 štrīņģ</p>
</div>
index.php:
$dom = new DOMDocument;
$dom->loadHTMLFile('footer.html');
echo $dom->saveHTML(); // results in all familiar effed' up characters
Thanks in advance!
Try a hack like this one:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
Several others are listed in the user comments here: http://php.net/manual/en/domdocument.loadhtml.php. It is also important that your document head includea meta tag to specify encoding FIRST, directly after the tag.
I would suggest using my answer here: https://stackoverflow.com/a/12846243/816753 and instead of adding another <head>, wrap your entire fragment in
<html>
<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>
<body><!-- your content here --></body>
</html>`
While I'm not sure about how to go about solving the problem with ->loadHTMLFile(), have you considered using file_get_contents() to get the HTML, run mb_convert_encoding() on that string, then pass that value in to ->loadHTML()?
Edit: Also, when you initialize DOMDocument, are you giving it the $encoding argument?
The key is for your browser only. Once the page is all built up, your browser should display the page correctly if it has the meta at the end.
You can always try to use the utf8_decode (or encode, I'm never sure lol) function before echo'ing the data like so:
echo utf8_decode($dom->saveHTML());

PHP XPath and Line Numbers

The following xpath works, but how can i get the line number at which the xpath finds the p element with class 'blah'?
<?php
$doc = new DOMDocument();
$doc->loadHTML('<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
</head>
<body>
<p class='blah'>some text here</p>
</body>
</html>');
$xpath = new DOMXPath($doc);
$xpath_query = "//*[contains(#class, 'blah')]";
?>
XPath and DOM have no concept of a line number. They only see nodes, and the linkages between them.
The DOM object itself may have some internal metadata which can relate a node back to which line it was on in the source file, but you'd have to rummage around inside the object and the DOM source to find out. Doesn't seem to be anything mentioned at http://php.net/dom.
Alternatively, if the node you're looking at, and/or the surrounding HTML is fairly/totally unique in the document, you could search the raw html for the matching HTML text of the node and get a line number that way.

html to text with domdocument class

How to get a html page source code without htl tags?
For example:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="content-language" content="hu"/>
<title>this is the page title</title>
<meta name="description" content="this is the description" />
<meta name="keywords" content="k1, k2, k3, k4" />
start the body content
<!-- <div>this is comment</div> -->
open
End now one noframes tag.
<noframes><span>text</span></noframes>
<select name="select" id="select"><option>ttttt</option></select>
<div class="robots-nocontent"><span>something</span></div>
<img src="url.png" alt="this is alt attribute" />
I need this result:
this is the page title this is the description k1, k2, k3, k4 start the body content this is title attribute open End now one noframes tag. text ttttt something this is alt attribute
I need too the title and the alt attributes.
Idea?
You could do it with a regex.
$regex = '/\<.\>/';
would be a very simple start to remove anything with < and > around it. But in order to do this, you're going to have to pull in the HTML as a file_get_contents() or some other function that will turn the code into text.
Addendum:
If you want individual attributes pulled as well, you're going to have to write a more complex regex to pull that text out. For instance:
$regex2 = '/\<.(?<=(title))(\=\").(?=\")/';
Would pull out (I think... I'm still learning RegEx) any text between < and title=", assuming you had no other matching expressions before title. Again, this would be a pretty complicated regex process.
This cannot be done in an automated way. PHP cannot know which node attributes you want to omit. You'd either had to create some code that iterates over all attributes and textnodes which you can feed a map, defining when to use a node's content or you just pick what you want with XPath one by one.
An alternative would be to use XMLReader. It allows you to iterate over the entire document and define callbacks for the element names. This way, you can define what to do with what element. See
http://www.ibm.com/developerworks/library/x-pullparsingphp.html
My solution is a bit more complicate but it worked fine for me.
If you are sure that you have XHTML, you can simply consider the code as XML (but you have to put everything in a proper wrapping).
Then with XSLT you can define some basic templates that do what you need.

Categories