From this question: What regex pattern do I need for this? I've been using the following code:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
foreach ($node->childNodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild(new DOMText($text),$childNode);
} else {
process($childNode, $replaceRules);
}
}
}
}
$replaceRules = array(
'/\b(c|C)olor\b/' => '$1olour',
'/\b(kilom|Kilom|M|m)eter/' => '$1etre',
);
$htmlString = "<p><span style='color:red'>The color of the sky is: gray</p>";
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$string = $doc->saveHTML();
echo mb_substr($string,119,-15);
It works fine, but it fails (as the child node is replaced on the first instance) if the html has text and HTML. So it works on
<div>The distance is four kilometers</div>
but not
<div>The distance is four kilometers<br>1000 meters to a kilometer</div>
or
<div>The distance is four kilometers<div class="guide">1000 meters to a kilometer</div></div>
Any ideas of a method that would work on such examples?
Calling $node->replaceChild will confuse the $node->childNodes iterator. You can get the child nodes first, and then process them:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
$nodes = array();
foreach ($node->childNodes as $childNode) {
$nodes[] = $childNode;
}
foreach ($nodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText);
$node->replaceChild(new DOMText($text),$childNode);
}
else {
process($childNode, $replaceRules);
}
}
}
}
Related
I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).
I thought I would write a simple function to visit all the nodes in a DOM tree. I wrote it, gave it a not-too-complex bit of XML to work on, but when I ran it I got only the top-level (DOMDocument) node.
Note that I am using PHP's Generator syntax:
http://php.net/manual/en/language.generators.syntax.php
Here's my function:
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
DOMIterate($subnode);
// }
}
}
}
And the testcase code that is supposed to print the results:
$doc = new DOMDocument();
$doc->loadXML($input);
foreach (DOMIterate($doc) as $node) {
$type = $node->nodeType;
if ($type == XML_ELEMENT_NODE) {
$tag = $node-> tagName;
echo "$tag\n";
}
else if ($type == XML_DOCUMENT_NODE) {
echo "document\n";
}
else if ($type == XML_TEXT_NODE) {
$text = $node->wholeText;
echo "text: $text\n";
} else {
$linenum = $node->getLineNo();
echo "unknown node type: $type at input line $linenum\n";
}
}
The input XML is the first 18 lines of
https://www.w3schools.com/xml/plant_catalog.xml
plus a closing
If you're using PHP7, you can try this:
<?php
$string = <<<EOS
<div level="1">
<div level="2">
<p level="3"></p>
<p level="3"></p>
</div>
<div level="2">
<span level="3"></span>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadXML($string);
function DOMIterate($node)
{
yield $node;
if ($node->childNodes) {
foreach ($node->childNodes as $childNode) {
yield from DOMIterate($childNode);
}
}
}
foreach (DOMIterate($document) as $node) {
echo $node->nodeName . PHP_EOL;
}
Here's a working example - http://sandbox.onlinephpfunctions.com/code/ab4781870f8f988207da78b20093b00ea2e8023b
Keep in mind that you'll also get the text nodes that are contained within the tags.
Using yield in a function called from the generator doesn't return the value to the caller of the original generator. You need to use yield from to propagate the values back.
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
yield from DOMIterate($subnode);
// }
}
}
}
This requires PHP 7. If you're using an earlier version, see Recursive generators in PHP
How can I check if a p has a child node of iframe with DOMDocument?
For instance,
<p><iframe ....></p>
I want to print this only,
<iframe ....>
While,
<p>bla bla bal</p>
then do nothing or just print whatever inside the p,
<p>bla bla bal</p>
Or,
<p>bla bla <b>bal</b></p>
then do nothing or just print whatever inside the p,
<p>bla bla <b>bal</b></p>
my php,
$dom = new DOMDocument;
$dom->loadHTML($item_html);
if($dom->getElementsByTagName('p')->length > 1 )
{
...
}
else // if it is only a single paragraph... then do what I want above...
{
foreach ($dom->getElementsByTagName('p') as $node)
{
if ($node->hasChildNodes())
{
foreach( $dom->getElementsByTagName('iframe') as $iframe )
{
... something
}
}
else
{
...
}
}
}
is it possible?
You're trying to find all iframe elements that are the only childnodes of the p elements.
If found you want to replace their parent p element with them.
/** #var DOMElement $p */
foreach ($doc->getElementsByTagName('p') as $p) {
if ($p->childNodes->length !== 1) {
continue;
}
$child = $p->childNodes->item(0);
if (! $child instanceof DOMElement) {
continue;
}
if ($child->tagName !== 'iframe') {
continue;
}
$p->parentNode->insertBefore($child, $p);
$p->parentNode->removeChild($p);
}
This foreach loop just iterates over all p elements, ignores all that don't have a single child node that is not a DOMElement with the iframe tagname (note: always lowercase in the compare).
If one p element is found, then the inner iframe is moved before it and then the paragraph is removed.
Usage Example:
<?php
/**
* #link http://stackoverflow.com/q/19021983/367456
*/
$html = '
<p><iframe src="...."></p>
<p>bla bla bal</p>
<p>bla bla <b>bal</b></p>
<p></p>
';
$doc = new DOMDocument();
$doc->loadHTML($html);
/** #var DOMElement[] $ps */
// $ps = $;
/** #var DOMElement $p */
foreach ($doc->getElementsByTagName('p') as $p) {
if ($p->childNodes->length !== 1) {
continue;
}
$child = $p->childNodes->item(0);
if (!$child instanceof DOMElement) {
continue;
}
if ($child->tagName !== 'iframe') {
continue;
}
$p->parentNode->insertBefore($child, $p);
$p->parentNode->removeChild($p);
}
// output
foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child) {
echo $doc->saveHTML($child);
}
Demo and Output:
<iframe src="...."></iframe>
<p>bla bla bal</p>
<p>bla bla <b>bal</b></p>
<p></p>
Hope this is helpful.
So do this:
$dom = new DOMDocument;
$dom->loadHTML($item_html);
if($dom->getElementsByTagName('p')->length > 1 )
{
...
}
else // if it is only a single paragraph... then do what I want above...
{
foreach ($dom->getElementsByTagName('p') as $node)
{
if ($node->hasChildNodes())
{
if($dom->getElementsByTagName('iframe')->length > 0 )
{
foreach( $dom->getElementsByTagName('iframe') as $iframe )
{
... something
}
}
}
else
{
...
}
}
}
How do I turn the output into a variable so i can cross reference it to see if it matches another variable I have set
foreach ($nodes as $i => $node) {
echo $node->nodeValue;
}
I know this is incorrect and wouldn't work but:
foreach ($nodes as $i => $node) {
$target = $node->nodeValue;
}
$match = "some text"
if($target == $match) {
// Match - Do Something
} else {
// No Match - Do Nothing
}
Actually this solves my question but maybe not the right way about it:
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTMLFile("http://www.example.com");
$xpath = new DomXPath($dom);
$nodes = $xpath->query("(//tr/td/a/span[#class='newprodtext' and contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'adidas')])[1]");
foreach ($nodes as $i => $node) {
echo $node->nodeValue, "\n";
$target[0] = $node->nodeValue;
}
$match = "adidas";
if($target == $match) {
// Match
} else {
// No Match
}
Your problem is more about general understanding of loops, assigning values to arrays and using if conditionals with php than using xpath.
In your foreach loop, you're assigning each $node's nodeValue to the same index in your $target array, $target will always have only one value (the last one)
In your if conditional statement, you're comparing an array (or null if $nodes has no items, so you probably want to declare $target first) against the string 'adidas', that will never be true.
You probably want to do something like:
$matched = false;
$match = 'adidas';
foreach ($nodes as $i => $node) {
$nodeValue = trim(strip_tags($node->textContent));
if ($nodeValue === $match) {
$matched = true;
break;
}
}
if ($matched) {
// Match
} else {
// No Match
}
Update
I see that this xpath expression was given to you in another answer, that presumably already does the matching, so you just need to check the length property of $nodes
if ($nodes->length > 0) {
// Match
} else {
// No match
}
I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.
I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.
Example code is:
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li></ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
½ ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.
Solution for not PHP 5.3.6++
$html =<<<HTML
<ul><li>text</li>
<li>½ of this is <strong>strong</strong></li></ul>
HTML;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.
It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in ½ tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in ½ <em>tag ½</em></strong></li>'.
'</ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
echo 'Node type is '.$child->nodeType.PHP_EOL;
switch ($child->nodeType) {
case 3:
$innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
break;
default:
echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
echo 'Node name '.$child->nodeName.PHP_EOL;
$innerHTML .= '<'.$child->nodeName.'>';
$innerHTML .= _get_inner_html( $child );
$innerHTML .= '</'.$child->nodeName.'>';
break;
}
}
return $innerHTML;
}
Need no iterate child nodes:
function innerHTML($node)
{$html=$node->ownerDocument->saveXML($node);
return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
}