XPath: how to iterate all text nodes?

XPath: how to iterate all text nodes? - php

Please tell me how to iterate through all text nodes inside a paragraph? After all, they can be 2-3 level.
For example, take the following paragraph:
<p>Lorem <i>ipsum dolor</i> sit <span>amet, <b><i>consectetur</i> adipisicing</b> elit</span>. Odit, sunt?</p>
In which you want to process all text nodes and return them to their places.
$content = '<p>Lorem <i>ipsum dolor</i> sit <span>amet, <b><i>consectetur</i> adipisicing</b> elit</span>. Odit, sunt?</p>';
$html = new DOMDocument();
$html->loadHTML($content);
$xpath = new DOMXpath($html);
$elements = $xpath->query('//descendant-or-self::p//node()');
// My processor (not working...)
foreach ($elements as $element) {
// Processed, only text nodes (not working...)
if ( $element->nodeType == 3 ) {
function() {
return $element->nodeValue = '<span style="background-color: yellow;">' . $element->nodeValue . '</span>';
}
}
// return to the place
echo $element->C14N();
}
You need to get such result:
<p>
<span style="background-color: yellow;">Lorem </span>
<i>
<span style="background-color: yellow;">ipsum dolor</span>
</i>
<span style="background-color: yellow;">sit </span>
<span>
<span style="background-color: yellow;">amet, </span>
<b><i>
<span style="background-color: yellow;">consectetur</span>
</i>
<span style="background-color: yellow;">adipisicing</span>
</b>
<span style="background-color: yellow;">elit</span>
</span>
<span style="background-color: yellow;">. Odit, sunt?</span>
</p>

This will wrap all the text nodes into span elements:
$content = '<p>Lorem <i>ipsum dolor</i> sit <span>amet, <b><i>consectetur</i> adipisicing</b> elit</span>. Odit, sunt?</p>';
$html = new DOMDocument();
$html->loadHTML($content);
$xpath = new DOMXpath($html);
$elements = $xpath->query('//descendant-or-self::p//text()');
/* #var DomNode $element*/
foreach ($elements as $element) {
$span = $html->createElement("span", $element->nodeValue);
$span->setAttribute("style", "background-color: yellow;");
$element->parentNode->replaceChild($span, $element);
}
echo $html->saveHTML();

Related

Get H2 text and href values from inside all H2 tags on the page using xpath?

I know nothing, ZERO, about xpath or DOM.
In the end I need the href value and the content of the span from 12 H2 tags on the page. I have figured out how to get each item individually but getting them all in one shot isn't clicking, no matter how much I read. A little help?
<h2 class="make-it-pretty">
<a class="more-pretty" href="some-file-somewhere">
<span class="another-class">Product Name</span>
</a>
</h2>
Here is what I use to get them individually.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$htext = $xpath->query('//h2[contains(#class, "make-it-pretty")]')->item(0);
echo $htext->textContent;

I would probably use $doc->loadHTMLFile instead, but:
<?php
$html = '<html lang="en"><head><meta charset="UTF-8" /><title>Title Here</title></head>
<body>
<h2 class="make-it-pretty"><a class="more-pretty" href="some-file-somewhere"><span class="another-class">Product Name</span></a></h2>
</body></html>';
$doc = #new DOMDocument(); $doc->loadHTML($html);
function getElementsByClassName($className, $withinNode = null){
global $doc;
$d = $withinNode ?? $doc;
$r = []; $a = $d->getElementsByTagName('*');
foreach($a as $n){
if($n->getAttribute('class') === $className)$r[] = $n;
}
return $r;
}
$anotherClass = getElementsByClassName('another-class');
// getElementsByClassName('make-it-pretty'); works as well, in this case
echo $anotherClass[0]->textContent;
?>

try this without Xpath
<?
$html ='<h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2>';
$dom = new DOMDocument("1.0", "utf-8");
if($dom->loadHTML($html, LIBXML_NOWARNING)){
$h2s = $dom->getElementsByTagName('h2');
foreach ($h2s as $h2) {
$as = $h2->getElementsByTagName('a');
echo '<pre>';
//print_r($as);
foreach($as as $a){
print_r('link :'.$a->getAttribute('href')."\n");
$spans = $a->getElementsByTagName('span');
}
foreach($spans as $span){
print_r('content :'.$span->nodeValue."\n");
}
}
}

How to extract the text using XPath between tag and some end tag

I have given the following HTML. The class names are always the same. Only the text between the tags varies and has different length and content.
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text <q class="aaa">this not</q></span>
</a>
How do I extract the content between the tag with class "zzz" and the end of the line but the element with class "aaa" should not included in the result? Is it possible?
The element with class "aaa" may exists or not:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span>
</a>
The expected result should be:
This is the required text
Also the part "the required text" may exists or not:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span></span>
</a>
so the result should be:
This is
I try this in PHP using DOMXPath.

XPath solution :
$xml = <<<'XML'
<a><span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span></a>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$elements = $xpath->query('//text()[parent::*[not(#class="aaa")]][preceding::span[#class="yyy"]][normalize-space()]');
foreach($elements as $element)
echo $element->nodeValue;
Output :
This is the required text

I don't know how to do this with XPath, necessarily, but here is a way you could do it without XPath.
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
$html = <<<'HTML'
<span class="xxx">
Not this text
<span class="yyy">not this text</span>
<span class="zzz">This is</span>
the required text
<q class="aaa">this not</q>
</span>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$count = 0;
foreach(walk($dom->firstChild) as $node) {
if (!($node instanceof DOMText) && $node->hasAttribute('class') && $node->getAttribute('class') === 'xxx') {
foreach(walk($node) as $n) {
if (isset($content)) {
$count++;
}
if (!($n instanceof DOMText) && $n->hasAttribute('class') && $n->getAttribute('class') === 'zzz') {
$content = $n->textContent;
}
if (isset($content) && $n instanceof DOMText && $count == 2) {
$content .= " " . $n->textContent;
break 2;
}
}
}
}
var_dump($content);
This gives you the desired result whether or not the "the required text" part is there.

PHP Simple HTML DOM Parser, Remove attributes from the TAG without any specific unique input

my input
<div id='makeme' class='testme'>
<span id='whatspanID'>somthing</span>
<p class='ptagclass'></p>
</div>
My expected output
<div>
<span></span>
<p></p>
</div>
To remove the content inside the tag, i can use below snippet, but how to remove the attributes from the tag
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();

Using DOM and Xpath allows you to select text and attribute nodes.
$html = <<<'HTML'
<div id='makeme' class='testme'>
<span id='whatspanID'>somthing</span>
<p class='ptagclass'></p>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$div = $xpath->evaluate('//div[#id="makeme"]')->item(0);
$nodes = $xpath->evaluate('.//text()|#*|.//*/#*', $div);
foreach ($nodes as $node) {
if ($node instanceof DOMAttr) {
$node->parentNode->removeAttributeNode($node);
} else {
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHtml($div);
Output:
<div>
<span></span><p></p>
</div>

Why the getElementsByTagName is not working in this example

I have an DomElement with this content:
$cell = <td colspan=3>
<p class=5tablebody>
<span style='position:relative;top:14.0pt'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>
There, I am geting the p element with:
$paragraphs = $xpath->query('.//p', $cell);
My goal is to get the img element from the cell element.
I have tried:
$paragraph->getElementsByTagName('img')->item(0);
But I am getting null. Any idea why?
Thank you

Is this what you after?
$htmlStr = '<td colspan=3>
<p class=5tablebody>
<span style=\'position:relative;top:14.0pt\'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>';
$doc = new DOMDocument();
$doc->loadHTML($htmlStr);
$paragraphs = $doc->getElementsByTagName('img');
var_dump($paragraphs->item(0)->getAttribute('src'));
Outputs:
string 'forMerrin_files/image020.png' (length=28)

The second argument of DOMXpath::query() has to be a context node, you can not just use some HTML string. I suggest using DOMXpath::evaluate() anyway. The syntax of both methods is the same, but query() is limited to Xpath expressions that return a node list, evaluate() allows Xpath expressions that return scalars, too.
$html = <<<HTML
<td colspan=3>
<p class=5tablebody>
<span style='position:relative;top:14.0pt'>
<img width=300 height=220 src="forMerrin_files/image020.png">
</span>
</p>
</td>
HTML;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
// for each td element
foreach ($xpath->evaluate('//td') as $cell) {
// for each img inside a p
foreach ($xpath->evaluate('.//p//img', $cell) as $img) {
var_dump($img->getAttribute('src'));
}
}
Output: https://eval.in/147576
string(28) "forMerrin_files/image020.png"

PHP DOMDocument parse HTML

I have the following HTML markup
<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p>-<span class='create'></span>
<a class='permalink' href=""></a>
</div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p><span class='create'></span><a class='permalink' href=""></a>
</div>
The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
$attr = $book->getAttribute('class');
//if div contenteditable
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
}
else {
$new = new DOMDocument();
$newxpath = new DOMXPath($new);
$avatar = $xpath->query("(//img[#class='avatar']/#src)[$q]");
$picture = $xpath->query("(//p/img[#class='pic']/#src)[$q]");
$fulltext = $xpath->query("(//p/span[#class='fulltext'])[$q]");
$permalink = $xpath->query("(//a[#class='permalink'])[$q]");
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
$q++;
}
$i++;
}
But I think that there's a better way for parsing the HTML. Is there? Thank you in advance

Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.
Here comes a version of your code that follows the above said:
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');
foreach($divs as $book) {
$attr = $book->getAttribute('class');
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
} else {
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
$picture = $xpath->query("p/img[#class='pic']/#src", $book);
$fulltext = $xpath->query("p/span[#class='fulltext']", $book);
$permalink = $xpath->query("a[#class='permalink']", $book);
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
}
}

As a matter of fact, you do it the right way : html has to be parsed with a DOM object.
Then some optimisation can be brough :
$div = $xpath->query('//div');
is quite greedy, a getElementsByTagName should be more appropriate :
$div = $dom->getElementsByTagName('div');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

XPath: how to iterate all text nodes? - php

Related

Get H2 text and href values from inside all H2 tags on the page using xpath?

How to extract the text using XPath between tag and some end tag

PHP Simple HTML DOM Parser, Remove attributes from the TAG without any specific unique input

Why the getElementsByTagName is not working in this example

PHP DOMDocument parse HTML

Categories

Resources