What does LIBXML_NOBLANKS do, exactly? - php

What is the difference between
$domd=new DOMDocument();
$domd->loadHTML($html, LIBXML_NOBLANKS);
and
$domd=new DOMDocument();
$domd->loadHTML($html, 0);
?
edit: just in case someone wants to remove all empty+whitespace text nodes (which is not exactly what LIBXML_NOBLANKS does), here's a function to do just that,
$removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
if ($node->hasChildNodes()) {
// Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
// that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
$removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
}
}
if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && (strlen(trim($node->textContent)) === 0)) {
//echo "Removing annoying POS";
// var_dump($node);
$node->parentNode->removeChild($node);
} //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
};
$dom=new DOMDocument();
$dom->loadHTML($html);
$removeAnnoyingWhitespaceTextNodes($dom);

The LIBXML_NOBLANKS parser option removes all text nodes containing only whitespace. Consider the following document, for example:
<doc>
<elem>text</elem>
</doc>
Normally, the element doc has three children: A whitespace text node, the element elem and another whitespace text node. When parsing with LIBXML_NOBLANKS, the doc element will only have a single element child.

Probably:
LIBXML_NOBLANKS Removes all insignificant whitespace within the document.
However I found no clear sign that this (borrowed) description fits 100% to what in the PHP documentation is written:
LIBXML_NOBLANKS (int)
Remove blank nodes
Which wondered me and I guess the reference here is to libxml2:
XML_PARSE_NOBLANKS = 256 : remove blank nodes
And I could find more Q&A accessible in https://stackoverflow.com/a/18521956/367456 and it seems that this is probably different to insignificant whitespace.

Related

PHP DOM Why does removing a child node of an element with removeChild interrupt a foreach loop over its child nodes?

I have encountered a puzzling behavior of the DOM method removeChild. When looping over the child nodes of a DOMElement, removing one of these nodes along the way interrupts the loop, i.e., the loop does not iterate over the remaining child nodes.
Here is a minimal example:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
foreach ($text->childNodes as $node) {
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
If I comment out the line $node->parentNode->removeChild($node);, then the output is the entire test string, i.e., A sample text with mixed content of various sorts, as expected. With that line, however, only the first child node is output, i.e., A sample text with. That is, removing the first child node as the loop passes over it apparently interrupts the loop; the remaining child nodes are not processed. Why is that?
Thanks in advance for your help!
Implementing the suggestions of the comments on my question, I came up with the following solution:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
$child_nodes = $text->childNodes;
for($n = $child_nodes->length-1; $n >= 0; --$n) {
$node = $child_nodes->item($n);
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
That is, I go through the child nodes in reverse order, using a method suggested in another posting. In this way, all nodes are processed: The output is various sorts of mixed contentA sample text with. Note the reverse order of the text fragments. In my specific use case, this reversal does not matter because I am not actually echoing the text nodes, but performing another kind of operation on them.

PHP DOM parsing text between <hr> tags

I am trying to parse some HTML to get the text between two <hr> tags using DOM with PHP but I don't get any output when I pass in hr into getElementsByTagName:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<hr>Text<hr>");
$hr = $dom->getElementsByTagName("hr");
for ($i=0; $i<$hr->length; $i++) {
echo "[". $i . "]" . $hr->item($i)->nodeValue . "</br>";
}
?>
When I run this code, it doesn't output anything however, if I change "hr" to "*" then it outputs:
[0]Text
[1]Text
[2]
[3]
(Why four lines of results?)
I run this code on a webserver which has PHP version 7.1.3 running. I can't use functions such as file_get_html or str_get_html because it returns an error about Undefined call to function ...
Why doesn't the hr tag produce results?
Perhaps what you're looking for is the contents of the text node between two <hr> elements? In that case we go looking for siblings with an XPath expression:
<?php
$dom = new DOMDocument();
$dom->loadHTML("Some text<hr>The text<hr>Other text");
$xp = new DomXPath($dom);
$result = $xp->query("//text()[(preceding-sibling::hr and following-sibling::hr)]");
foreach ($result as $i=>$node) {
echo "[$i]$node->textContent<br/>\n";
}
This happens, because the <hr> has no child nodes (text are also childs).
To get the text between the <hr> nodes, you have to iterate over all nodes on the same level and check if the current node is a text node (nodeType == 3), the previous sibling must be a HR node and the next sibling must be a HR node too.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<hr>Text<hr>");
foreach ($dom->childNodes as $childNode) {
if (3 !== $childNode->nodeType) {
continue;
}
if (!$childNode->previousSibling || ('HR' !== $childNode->previousSibling->nodeName)) {
continue;
}
if (!$childNode->nextSibling || ('HR' !== $childNode->nextSibling->nodeName)) {
continue;
}
echo "{$childNode->nodeValue}\n";
}
But if you want to get anything between the hr nodes it will be more complicated.

getting the text content of a specific DOMElement

After a little hairpulling, I discovered that DOMElement->textContent also returns the combined text from the children of that element.
Looking around a bit I saw people suggesting DOMElement->firstChild->textContent but this is no good for me because I'm looking through the document following the hierarchy and cues from element attributes, the data is just as likely to be on a branch rather than a leaf so I would get multiple hits even though only one of them is the correct one.
Is there an actual way to get the text content of this one specific element and none of its childrens?
EDIT: nvm, found a way to make sure
function get_text($el) {
if (is_a($el->firstChild, "DOMText")) return $el->firstChild->textContent;
return "";
}
Simply iterate the child nodes and check if the next node is a text. You
might want to skip the nodes consisting of only space characters, though:
function getNodeText(DOMNode $node) {
if ($node->nodeType === XML_TEXT_NODE)
return $node->textContent;
$node = $node->firstChild;
while ($node) {
if ($node->nodeType === XML_TEXT_NODE &&
$text = trim($node->textContent))
{
return $text;
}
$node = $node->nextSibling;
}
return '';
}
$xml = <<<'EOXML'
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<x>x text</x>
child text
</child>
root text
</root>
EOXML;
$doc = new DOMDocument();
$doc->loadXML($xml);
var_dump(getNodeText($doc->getElementsByTagName('x')[0]));
var_dump(getNodeText($doc->getElementsByTagName('root')[0]));
var_dump(getNodeText($doc->getElementsByTagName('child')[0]));
Sample output
string(6) "x text"
string(9) "root text"
string(10) "child text"

DOMDocument adds the line break between nodes where there is nothing between the nodes

How to prevent DOMDocument from adding the line break \n after the first paragraph node? When there is space between the nodes the line break is not added.
<?php
$text = '<p></p><p></p>';
$dom = new \DOMDocument();
$dom->loadHTML($text);
$innerHTML = "";
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$innerHTML .= $dom->saveHTML($child);
}
echo json_encode($innerHTML);
The code above returns:
"<p><\/p>\n<p><\/p>"
There is the online code there https://3v4l.org/UfZTG
I ran into this issue today. My use-case seems to have been the same, namely generating an “inner HTML” string from an element. Because I did not want to indiscriminately change or trim white space from nodes, I found a different solution.
Running DOMDocument::saveHTML on a DOMDocumentFragment (in my testing) never seems to add any extra white space.
Going from your example, you can get the HTML output of the first P element without a trailing \n by doing:
$frag = $dom->createDocumentFragment();
$frag->appendChild(
$dom->getElementsByTagName('p')->item(0)->cloneNode(true)
);
echo json_encode($dom->saveHTML($frag)); // Renders "<p><\/p>".
Note that you must use DOMNode::cloneNode, otherwise you are moving the element into the DOMDocumentFragment and remove it from its original place.
If you are looking for an inner HTML function, the following should work. It will move all the child nodes of an element into a DOMDocumentFragment, then get the HTML value, and put the nodes back where they belong. This means we aren’t cloning notes, nor leaving the tree changed when we are done.
function innerHTML(\DOMElement $element): string
{
$fragment = $element->ownerDocument->createDocumentFragment();
while ($element->hasChildNodes()) {
$fragment->appendChild($element->firstChild);
}
$html = $element->ownerDocument->saveHTML($fragment);
$element->appendChild($fragment);
return $html;
}

DOM in PHP: Decoded entities and setting nodeValue

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.
Suppose we have the following code
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
foreach($node_list as $node) {
//do something
}
If the code in the loop is something like
$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);
it works fine. But if it's more like
$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;
and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.
At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know
if that is sufficient,
if not, what would suffice,
and if there are more elegant solutions for this, as in the case of attribute manipulation.
The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.
EDIT
It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($output);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');
foreach($node_list as $node) {
$node->nodeValue = $node->textContent;
}
echo $doc->saveXML();
If I execute this code on the CLI with
php beeb.php |egrep 'link|Warning'
I get results like
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>
which should be
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>
(and is, if the loop is omitted) and according warnings
Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15
When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.
Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.
So let's just try that out and set it to & and '& and see what happens:
$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');
$text = $doc->documentElement->childNodes->item(0);
echo "Before Edit: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 1: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 2: ", $doc->saveXML($text), "\n";
The output then is as the following (PHP 5.0.0 - 5.5.0):
Before Edit: *
After Edit 1: &
After Edit 2: &amp;
This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.
So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.
As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.
As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity & instead of verbatim , however only this character.
So this needs a little bit more work:
Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
Remove the node-value of the element node so it's empty.
Append the DOMText node form first step as child.
And done. Here your inner foreach modified showing this:
foreach($node_list as $node) {
$text = $doc->createTextNode($node->textContent);
$node->nodeValue = "";
$node->appendChild($text);
}
For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.
Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:
$doc = new DOMDocument();
$doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputs
Warning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &amp;&lt;<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>
In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}

Categories