Error when parsing DOMDocument with PHP - php

We are upgrading our Software to PHP 7.2.3 and I have the following code snippet which worked fine in previous versions:
$doc = new DOMDocument();
$doc->loadHTML("<html><body>".($_POST['reportForm_structure'])."</body></html>");
$root = $doc->documentElement->firstChild->firstChild->firstChild;
file_put_contents('D:\testoutput.txt', print_r($root ,true));
foreach($root->childNodes as $child) {
if ($child->nodeName == "ul") {
foreach($child->childNodes as $ulChild) {
$this->loadNodes($ulChild, $this->report);
}
}
}
The file_put_contentsis just for error research.
I get the following error: Invalid argument supplied for foreach(). The message refers to line of code where the first foreach loop is. So the data structure is not initialized correctly. I can see that the conversion from HTML to DOMDocument does not work properly anymore. When I check the output of file_put_contents I can see that $root is a DOMText object instead of a DOMElement object but why? When pass the argument of loadHTMLdirectly to file_put_contents,
file_put_contents('D:\testoutput.txt', print_r("<html><body>".($_POST['reportForm_structure'])."</body></html>", true);
the output looks like proper HTML, so that's why I am confused that I does not work anymore.
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
<ins> </ins>HeaderText
<ul><li class="open last" id="id1" rel="header"><ins> </ins>Test123
<ul><li class="open leaf last" id="id2" rel="header"><a class="clicked" href="#"><ins> </ins>Test456</a></li></ul></li></ul></li>
Does anyone know how to solve this issue. Did I miss something in the configuration here?

I couldn't reproduce the DOMText node with the code you show. But my guess is that you are preserving whitespace and then fetch the whitespace node between the ul element and the li element.
v-------- whitespace node
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
In any case, if you want the element with the ID "root", use a more precise query, e.g. use
$root = $doc->getElementById("root");
You can also you can set $doc->preserveWhiteSpace = false but it's better to query for the node by ID instead of traversing down three children and assuming it's that node.

Thanks #Gordon and #DarsVaeda for pointing me in the right direction. DOMDocument interprets carriage returns and tabs as text nodes. I had to remove those to make it work again. Changed
$doc->loadHTML("<html><body>".$_POST['reportForm_structure']."</body></html>");
to
$doc = new DOMDocument();
$string = trim(preg_replace('/\t+/', '', $_POST['reportForm_structure']));
$string = preg_replace( "/\r|\n/", "", $string );
$doc->loadHTML("<html><body>".$string."</body></html>");

Related

Changing a tag <a> to <div> with DOMDocument on WordPress

I'm a beginner in PHP and I would like to set up several functions to replace specific code bits on WordPress (including plugin elements that I can't edit directly).
Below is an example (first line: initial result, second line: desired result):
<span class="fn" itemprop="name">Gael Beyries</span>
<div class="vcard author"><span class="fn" itemprop="name">Gael Beyries</span></div>
PS: I came across this topic: Parsing WordPress post content but the example is too complicated for what I want to do. Could you present me an example code that solves this problem so I can try to modify it to modify other html elements?
Although I'm not sure how this fits into WP, I have basically taken the code from the linked answer and adapted it to your requirements.
I've assumed you want to find the <a> tags with class="vcard author" and this is the basis of the XPath expression. The code in the foreach() loop just copies the data into a new node and replaces the old one...
function replaceAWithDiv($content){
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$aTags = $xpath->query('//a[#class="vcard author"]');
foreach($aTags as $a){
// Create replacement element
$div = $dom->createElement("div");
$div->setAttribute("class", "vcard author");
// Copy contents from a tag to div
foreach ($a->childNodes as $child ) {
$div->appendChild($child);
}
// Replace a tag with div
$a->parentNode->replaceChild($div, $a);
}
return $dom->saveHTML();
}

Can't get the plaintext without using foreach

I'm trying to use simple_html_dom to get the plaintext ('THIS TEXT') of one HTML element:
<div class="parent">
<span><i class="fa fa-awesome"></i>THIS TEXT</span>
</div>
I'm getting that text by using:
foreach($html->find('div.parent span.child') as $text){
echo $text->plaintext;
}
But it is just one element and I'm searching for a way to get that plaintext without using foreach loop (since it is just one element).
P.S: I've been trying this:
$html->find('div.parent span.child', 1);
But var_dump-ing that results in a NULL.
I also tried this:
$html->find('div.delivery-status span.status', 1)->plaintext;
But var_dump-ing it results in:
Notice: Trying to get property 'plaintext' of non-object in
C:\xampp\htdocs\curl\index.php on line 19
I also read the documentation but i can't seem to be able to figure this one out :(. Can somebody please help me or at least point me into the right direction? :-s
Thank you!:D
You're using a pretty ancient library, but it looks like a foreach loop is how the author intended it to work. This is typical for DOM functions that return a node list for most functions. What's wrong with the loop? You could do this in plain old PHP as well:
$html = <<< HTML
<div class="parent">
<span><i class="fa fa-awesome"></i>THIS TEXT</span>
</div>
HTML;
$dom = new \DomDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new \DOMXPath($dom);
$data = $xpath->query("//div[#class='parent']/span/text()");
echo $data[0]->textContent;
The <span> in the question does not have a child css class, so your selector is not correct. Also you seem to be missing the point that when calling find, the index of children is zero based. Try this:
$str = '<div class="parent"><span><i class="fa fa-awesome"></i>THIS TEXT</span></div>';
$html = str_get_html($str);
// no .child for the span, and 0 as the index of target child
print $html->find('div.parent span', 0)->plaintext;

DOMDocument adds the line break between nodes where there is nothing between the nodes

How to prevent DOMDocument from adding the line break \n after the first paragraph node? When there is space between the nodes the line break is not added.
<?php
$text = '<p></p><p></p>';
$dom = new \DOMDocument();
$dom->loadHTML($text);
$innerHTML = "";
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$innerHTML .= $dom->saveHTML($child);
}
echo json_encode($innerHTML);
The code above returns:
"<p><\/p>\n<p><\/p>"
There is the online code there https://3v4l.org/UfZTG
I ran into this issue today. My use-case seems to have been the same, namely generating an “inner HTML” string from an element. Because I did not want to indiscriminately change or trim white space from nodes, I found a different solution.
Running DOMDocument::saveHTML on a DOMDocumentFragment (in my testing) never seems to add any extra white space.
Going from your example, you can get the HTML output of the first P element without a trailing \n by doing:
$frag = $dom->createDocumentFragment();
$frag->appendChild(
$dom->getElementsByTagName('p')->item(0)->cloneNode(true)
);
echo json_encode($dom->saveHTML($frag)); // Renders "<p><\/p>".
Note that you must use DOMNode::cloneNode, otherwise you are moving the element into the DOMDocumentFragment and remove it from its original place.
If you are looking for an inner HTML function, the following should work. It will move all the child nodes of an element into a DOMDocumentFragment, then get the HTML value, and put the nodes back where they belong. This means we aren’t cloning notes, nor leaving the tree changed when we are done.
function innerHTML(\DOMElement $element): string
{
$fragment = $element->ownerDocument->createDocumentFragment();
while ($element->hasChildNodes()) {
$fragment->appendChild($element->firstChild);
}
$html = $element->ownerDocument->saveHTML($fragment);
$element->appendChild($fragment);
return $html;
}

php DOMDocument nodeName property returning '#text' with the nodeName

I want to extract the content of body of a html page along with the tagNames of its child. I have taken an example html like this:
<html>
<head></head>
<body>
<h1>This is H1 tag</h1>
<h2>This is H2 tag</h2>
<h3>This is H3 tag</h3>
</body>
</html>
I have implemented the php code like below and its working fine.
$d=new DOMDocument();
$d->loadHTMLFile('file.html');
$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
for($i=0;$i<$l->length;$i++)
{
echo "<".$l->item($i)->nodeName.">".$l->item($i)->nodeValue."</".$l->item($i)->nodeName.">";
}
This code is working perfectly fine, but when I tried to do this using foreach loop instead of for loop, the nodeName property was returning '#text' with every actual nodeName.
Here is that code
$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
foreach ($l as $li) {
echo $li->childNodes->item(0)->nodeName."<br/>";
}
Why so?
When I've had this problem it was fixed by doing the following.
$xmlDoc = new DOMDocument();
$xmlDoc->preserveWhiteSpace = false; // important!
You can trace out your $node->nodeType to see the difference. I get 3, 1, 3 even though there was only one node (child). Turn white space off and now I just get 1.
GL.
In DOM, everything is a 'node'. Not just the elements (tags); comments and text between the elements (even if it's just whitespaces or newlines, which seems to be the case in your example) are nodes, too. Since text nodes don't have an actual node name, it's substituted with #text to indicate it's a special kind of node.
Apparently, text nodes are left out when manually selecting child nodes with the item method, but included when iterating over the DOMNodeList. I'm not sure why the class behaves like this, someone else will have to answer that.
Beside nodeName and nodeValue, a DOMNode also has a nodeType property. By checking this property against certain constants you can determine the type of the node and thus filter out unwanted nodes.
I'm coming a little late to this but the best solution for me was different. The issue its that the TEXT node doesn't know it's name but his parent do so all you need to know it's ask his parent for the nodeValue to get the key.
$dom = new DOMDocument();
$dom->loadXML($stringXML);
$valorizador = $dom->getElementsByTagName("tagname");
foreach ($valorizador->item(0)->childNodes as $item) {
$childs = $item->childNodes;
$key = $item->nodeName;
foreach ($childs as $i) {
echo $key." => ".$i->nodeValue. "\n";
}
}

How to keep DOMDocument from saving < as &lt

I'm using simpleXML to add in a child node within one of my XML documents... when I do a print_r on my simpleXML object, the < is still being displayed as a < in the view source. However, after I save this object back to XML using DOMDocument, the < is converted to < and the > is converted to >
Any ideas on how to change this behavior? I've tried adding dom->substituteEntities = false;, but this did no good.
//Convert SimpleXML element to DOM and save
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
$dom->substituteEntities = false;
$dom->loadXML($xml->asXML());
$dom->save($filename);
Here is where I'm using the <:
$new_hint = '<![CDATA[' . $value[0] . ']]>';
$PrintQuestion->content->multichoice->feedback->hint->Passage->Paragraph->addChild('TextFragment', $new_hint);
The problem, is I'm using simple XML to iterate through certain nodes in the XML document, and if an attribute matches a given ID, a specific child node is added with CDATA. Then after all processsing, I save the XML back to file using DOMDocument, which is where the < is converted to &lt, etc.
Here is a link to my entire class file, so you can get a better idea on what I'm trying to accomplish. Specifically refer to the hint_insert() method at the bottom.
http://pastie.org/1079562
SimpleXML and php5's DOM module use the same internal representation of the document (facilitated by libxml). You can switch between both apis without having to re-parse the document via simplexml_import_dom() and dom_import_simplexml().
I.e. if you really want/have to perform the iteration with the SimpleXML api once you've found your element you can switch to the DOM api and create the CData section within the same document.
<?php
$doc = new SimpleXMLElement('<a>
<b id="id1">a</b>
<b id="id2">b</b>
<b id="id3">c</b>
</a>');
foreach( $doc->xpath('b[#id="id2"]') as $b ) {
$b = dom_import_simplexml($b);
$cdata = $b->ownerDocument->createCDataSection('0<>1');
$b->appendChild($cdata);
unset($b);
}
echo $doc->asxml();
prints
<?xml version="1.0"?>
<a>
<b id="id1">a</b>
<b id="id2">b<![CDATA[0<>1]]></b>
<b id="id3">c</b>
</a>
The problem is that you're likely adding that as a string, instead of as an element.
So, instead of:
$simple->addChild('foo', '<something/>');
which will be treated as text:
$child = $simple->addChild('foo');
$child->addChild('something');
You can't have a literal < in the body of the XML document unless it's the opening of a tag.
Edit: After what you describe in the comments, I think you're after:
DomDocument::createCDatatSection()
$child = $dom->createCDataSection('your < cdata > body ');
$dom->appendChild($child);
Edit2: After reading your edit, there's only one thing I can say:
You're doing it wrong... You can't add elements as a string value for another element. Sorry, you just can't. That's why it's escaping things, because DOM and SimpleXML are there to make sure you always create valid XML. You need to create the element as an object... So, if you want to create the CDATA child, you'd have to do something like this:
$child = $PrintQuestion.....->addChild('TextFragment');
$domNode = dom_import_simplexml($child);
$cdata = $domNode->ownerDocument->createCDataSection($value[0]);
$domNode->appendChild($cdata);
That's all there should be to it...

Categories