Raw content of DOMElement::innerHTML - php

How, in PHP, I can get the raw content of DOMElement, like JS innerHTML does?
I tried with saveHTML() or saveXML() iterating over each childNodes to simulate innerHTML, but it replaced code like turning <br /> to <br> or <br/> (in case of the XML version).

This can be achieved in a hacky but reliable way. PHP has the equivalent of outerHTML by passing the node to its parent document's saveHTML() method. Because this output is well-formed and escaped, you can easily strip the single outer tag from the text, leaving the desired innerHTTML.
Example:
$dom = new DOMDocument;
$dom->loadHTML('<div><p with="scary<>\'"" attrs=40 ok>Hello <em>World</em></div>');
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//p') as $p) {
$innerHTML = preg_replace('#^<([^>\\s]+)[^>]*>(.*)</\\1>$#s', '$2', $dom->saveHTML($p));
var_dump($p);
}
Demo of regex: https://regex101.com/r/yEVMQx/2
Note the s flag on the regex is critical.

Related

Output BR tag, using simpleXML

I want to have the text string "hello there" separate in 2 lines.
For this I need the simpleXML to create the "br-tag" in the output file result.xml, but I only get produced the code <br>.
<?php
// DOMDocument
$dom = new DomDocument('1.0', 'UTF-8');
$dom->formatOutput = true;
$html = $dom->appendChild($dom->createElement("html"));
$xmlns = $dom->createAttribute('xmlns');
$xmlns->value = 'http://www.w3.org/1999/xhtml';
$html->appendChild($xmlns);
// SimpleXML
$sxe = simplexml_import_dom($dom);
$head = $sxe->addChild('head', ' ');
$body = $sxe->addChild('body', 'hello <br> there');
echo $sxe->asXML('result.xml');
Result:
hello <br> there
Wanted Result:
hello
there
Firstly, PHP's SimpleXML extension works only with XML, not HTML. You're rightly mentioning XHTML in your setup code, but that means you need to use XML self-closing elements like <br /> not HTML unclosed tags like <br>.
Secondly, the addChild method takes text content as its second parameter, not raw document content; so as you've seen, it will automatically escape < and > for you.
SimpleXML is really designed around the kind of XML that's a strict tree of elements, rather than a markup language with elements interleaved with text content like XHTML, so this is probably a case where you're better off sticking to the DOM.
Even then, there's no equivalent of the JS "innerhtml" property, I'm afraid, so I believe you'll have to add the text and br element as separate nodes, e.g.
$body = $html->appendChild( $dom->createElement('head') );
$body->appendChild( $dom->createTextNode('hello') );
$body->appendChild( $dom->createElement('br') );
$body->appendChild( $dom->createTextNode('world') );

How can I remove <br/> if no text comes before or after it? DOMxpath or regex?

How can I remove <br/> if no text comes before or after it?
For instance,
<p><br/>hello</p>
<p>hello<br/></p>
they should be rewritten like this,
<p>hello</p>
<p>hello</p>
Should I use DOMxpath or regex would be better?
(Note: I have a post about removing <p><br/></p> with DOMxpath earlier, and then I came across this issue!)
EDIT:
If I have this in the input,
$content = '<p><br/>hello<br/>hello<br/></p>';
then it should be
<p>hello<br/>hello</p>'
To select the mentioned br you can use:
"//p[node()[1][self::br]]/br[1] | //p[node()[last()][self::br]]/br[last()]"
or, (maybe) faster:
"//p[br]/node()[self::br and (position()=1 or position()=last())]"
Just getting the br when the first (or last) node of p is br.
This will select br such as:
<p><br/>hello</p>
<p>hello<br/></p>
and first and last br like in:
<p><br/>hello<br/>hello<br/></p>
not middle br like in:
<p>hello<br/>hello</p>
PS: to get eventually the first br in a pair like this <br/><br/>:
"//br[following::node()[1][self::br]]"
In case for some code, I could get it to working like this (Demo). It has a slight modification from #empo's xpath (very slightly) and shows the removal of the matches as well as some more test-cases:
$html = <<<EOD
<p><br/>hello</p>
<p>hello<br/></p>
<p>hello<br/>Chello</p>
<p>hello <i>molly</i><br/></p>
<p>okidoki</p>
EOD;
$doc = new DomDocument;
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodes = $xpath->query('//p[node()[1][self::br] or node()[last()][self::br]]/br');
foreach($nodes as $node) {
$node->parentNode->removeChild($node);
}
var_dump($doc->saveHTML());

simplexml php get xml

If I have a document like this:
<!-- in doc.xml -->
<a>
<b>
greetings?
<c>hello</c>
<d>goodbye</c>
</b>
</a>
Is there any way to use simplexml (or any php builtin really) to get a string containing:
greetings?
<c>hello</c>
<d>goodbye</d>
Whitespace and such doesn't matter.
Thanks!
I must admit this wasn't as simple as one would think. This is what I came up with:
$xml = new DOMDocument;
$xml->load('doc.xml');
// find just the <b> node(s)
$xpath = new DOMXPath($xml);
$results = $xpath->query('/a/b');
// get entire <b> node as text
$node = $results->item(0);
$text = $xml->saveXML($node);
// remove encapsulating <b></b> tags
$text = preg_replace('#^<b>#', '', $text);
$text = preg_replace('#</b>$#', '', $text);
echo $text;
Regarding the XPath query
The query returns all matching nodes, so if there are multiple matching <b> tags, you can loop through $results to get them all.
My query for '/a/b' assumes that <a> is the root and <b> is its child/immediate descendant. You could alter it for different scenarios. Here's an XPath reference. Some adjustments might include:
'a/b' –– <b> is child of <a>, but <a> is anywhere, not just in the root
'a//b' –– <b> is a descendant of <a> no matter how deep, not just a direct child
'//b' –– all <b> nodes anywhere in the document
Regarding method of obtaining string contents
I tried using $node->nodeValue or $node->textContent, but both of them strip out the <c> and <d> tags, leaving just the text contents of those. I also tried casting it as a DOMText object, but that didn't directly work and was more trouble than it was worth.
Regarding the use of regular expressions
It could be done without regex, but I found it easiest to use them. I wanted to make sure that I only stripped the <b> and </b> at the very beginning and end of the string, just in case there were other <b> nodes within the contents.
How about this? Since you already know the XML format:
<?php
$xml = simplexml_load_file('doc.xml');
$str = $xml->b;
$str .= "<c>".$xml->b->c."</c>";
$str .= "<d>".$xml->b->d."</d>";
echo $str;
?>
Here's an alternative using DOM (to balance the SimpleXML answers!) that outputs the contents of all of the first <b> element.
$doc = new DOMDocument;
$doc->load('doc.xml');
$bee = $doc->getElementsByTagName('b')->item(0);
$innerxml = '';
foreach ($bee->childNodes as $node) {
$innerxml .= $doc->saveXML($node);
}
echo $innerxml;

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?
This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.
The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

Categories