How can I execute XPath queries on DOMElements using PHP? - php

I'm trying to do Xpath queries on DOMElements but it doesn't seem to work. Here is the code
<html>
<div class="test aaa">
<div></div>
<div class="link">contains a link</div>
<div></div>
</div>
<div class="test bbb">
<div></div>
<div></div>
<div class="link">contains a link</div>
</div>
</html>
What I'm doing is this:
$dom = new DOMDocument();
$html = file_get_contents("file.html");
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//div[contains(#class,'test')]");
if (!$entries->length > 0) {
echo "Nothing\n";
} else {
foreach ($entries as $entry) {
$link = $xpath->query('/div[#class=link]',$entry);
echo $link->item(0)->nodeValue;
// => PHP Notice: Trying to get property of non-object
}
}
Everything works fine up to $xpath->query('/div[#class=link], $entry);. I don't know how to use Xpath on a particular DOMElement ($entry).
How can I use xpath queries on DOMElement?

It looks like you're trying to mix CSS selectors with XPath. You want to be using a predicate ([...]) looking at the value of the class attribute.
For example, your //div.link might look like //div[contains(concat(' ',normalize-space(#class),' '),' link ')].
Secondly, within the loop you try to make a query with a context node then ignore that by using an absolute location path (it starts with a slash).
Updated to reflect changes to the question:
Your second XPath expression (/div[#class=link]) is still a) absolute, and b) has an incorrect condition. You want to be asking for matching elements relative to the specified context node ($entry) with the class attribute having a string value of link.
So /div[#class=link] should become something like div[#class="link"], which searches children of the $entry elements (use .//div[...] or descendant::div[...] if you want to search deeper).

Related

Cleaning up deprecated HTML code with DOMXPath (convert nested <div> tags to <p> tags)

I'm trying to read Rich Text stored in an old MS Access database into a new PHP web app. The sanitised data will be displayed to users using CKEditor, which is quite strict on parsing standards compliant HTML code. However, the data stored in MS Access is often ill-formatted or uses deprecated HTML code.
Below is an example piece of data I am trying to sanitise:
<div align="right">Previous claim $ 935.00<div align="right"> This claim $1,572.50</div></div>
This data is meant to be two lines of text that are right-justified, however MS Access has used the deprecated align attribute to style the <div> tags instead of a style attribute, and has incorrectly nested them when in this scenario they should be sequential.
To turn this example data into two lines of text that are both right-justified and that CKEditor will read and display as intended (i.e. text appears as right justified), I am trying to replace the <div> tags with <p> tags, and inject an inline style attribute with right text-align to replace the deprecated align attribute.
I am using PHP's DOMXPath to clean up the data, with the following code:
$dom = new DOMDocument();
$dom->loadHTML($dataForCleaning, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div[#align]') as $node) {
$alignment = $node->getAttribute('align');
$newNode = $dom->createElement('p');
$newNode->setAttribute("style", "text-align:".$alignment);
$node->parentNode->insertBefore($newNode, $node);
foreach ($node->childNodes as $child) {
$newNode->appendChild($child);
}
$node->parentNode->removeChild($node);
}
I am using insertBefore in lieu of appendChild in trying to keep the sequence of elements the same, but this is what's causing the issues in this nested data example.
For non-nested <div> tags as the input data to be cleaned, the sanitised output html is correct. However, in this nested <div> example, the output ends up being:
<p style="text-align:right">Previous claim $ 935.00</p>
Note that the second line of text (This claim...) has been removed, as it was within a nested <div> as a child to the parent <div>
I don't mind if the resultant <p> tags remain nested, as CKEditor ends up cleaning these up, but I do need to make sure I'm not losing data like this current code does.
Thanks in advance for any help and guidance.
-Mark
There are a couple of things I've changed. The first is that rather than just append the existing node, I get it to clone the node and append the copy (in $newNode->appendChild($child->cloneNode(true));), the second thing I do is as you are moving the enclosed node, I think that the XPath is no longer pointing to this moved node. So instead of that, I check when copying the child nodes if you have the same pattern of a <div align="right"> node and if so I create a new node in the new format and add that instead...
foreach ($xpath->query('//div[#align]') as $node) {
$alignment = $node->getAttribute('align');
$newNode = $dom->createElement('p');
$newNode->setAttribute("style", "text-align:".$alignment);
$node->parentNode->insertBefore($newNode, $node);
foreach ($node->childNodes as $child) {
if ( $child instanceof DOMElement && $child->localName == "div"
&& $child->attributes->getNamedItem("align")->nodeValue == "right" ) {
$subNode = $dom->createElement('p', $child->nodeValue );
$subNode->setAttribute("style", "text-align:".$alignment);
$newNode->appendChild($subNode);
}
else {
$newNode->appendChild($child->cloneNode(true));
}
}
$node->parentNode->removeChild($node);
}
which for the sample you give will output...
<p style="text-align:right">
Previous claim $ 935.00
<p style="text-align:right"> This claim $1,572.50</p>
</p>

DomxPath query do not return a nodelist

I've been working on some practice application involving xPaths and Retrieving elements from other website.
I used DomXpath for it but it is not returning a result or nodelist.
Here's the code:
$DOM = new DOMDocument();
#$DOM->loadHTML($html);
$xpath = new DOMXPath($DOM);
$nodes = $xpath->query("//span[contains(#style,'font-size:25px;')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
The page source of the example:
<div class="front-view-content full-post">
<p>
<span style="font-size:25px; color:#98293D;">RED</span><br>
<span style="font-size:25px; color:#98293D;">BLUE</span><br>
<span style="font-size:25px; color:#98293D;">WHITE</span></p>
</div>
it doesn't return anything just a plain blank.
There is no semicolon ; in the source, so xpath doesn't match.
$nodes = $finder->query('//span[#style="font-size:25px"]');
Should work
Trying to match attributes that contain a certain value is a little more complicated than just doing [#style="your search string"], as this will only match a style attribute that exactly matches your search string.
To my knowledge there's no shorthand selectors in xpath, similar to the ones in CSS for instance, that allows you to do [#style*="your search string"] or [#style~="your search string"], etc.
To test if a string contains another string, you use the contains() function. You're example xpath query would then have to be transformed to:
//span[contains(#style,"font-size:25px;")]
Be aware though that matching isolated strings, at the word boundary if you will, (such as matching the class main in class="nav main", but not in class="nav maintenance", for instance), gets a little more complicated, still. I'll refer you to this answer for such an example.

Is there an easy way to get subelements with DomDocument and DomXPath?

Supposed I have HTML like this:
<div id="container">
<li class="list">
Test text
</li>
</div>
And I want to get the contents of the li.
I can get the contents of the container div using this code:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo $dom->saveHTML($xpath->query("//div[#id='container']")->item(0));
I was hoping I could get the contents of the subelement by simply adding it to the query (like how you can do it in simpleHtmlDom):
echo $dom->saveHTML($xpath->query("//div[#id='container'] li[#class='list']")->item(0));
But a warning (followed by a fatal error) was thrown, saying:
Warning: DOMXPath::query(): Invalid expression ...
The only way I know of to do what I'm wanting is this:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
$dom2 = new \DomDocument;
$dom2->loadHTML(trim($dom->saveHTML($xpath->query("//div[#id='container']")->item(0))));
$xpath2 = new \DomXPath($dom2);
echo $xpath2->query("//li[#class='list']")->item(0)->nodeValue;
However, that's an awful lot of code just to get the contents of the li, and the problem is that as items are nested deeper (like if I want to get `div#container ul.container li.list) I have to continue adding more and more code.
With simpleHtmlDom, all I would have had to do is:
$html->find('div#container li.list', 0);
Am I missing an easier way to do things with DomDocument and DomXPath, or is it really this hard?
You were close in your initial attempt; your syntax was just off by a character. Try the following XPath:
//div[#id='container']/li[#class='list']
You can see you had a space between the div node and the li node where there there should be a forward slash.
SimpleHTMLDOM uses CSS selectors, not Xpath. About anything in CSS selectors can be done with Xpath, too. DOMXpath::query() does only support Xpath expression that return a node list, but Xpath can return scalars, too.
In Xpath the / to separates the parts of an location path, not a space. It has two additional meanings. A / at the start of an location path makes it absolute (it starts at the document and not the current context node). A second / is the short syntax for the descendant axis.
Try:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo trim($xpath->evaluate("string(//div[#id='container']//li[#class='list'])"));
Output:
Test text
In CSS selector sequences the space is a combinator for two selectors.
CSS: foo bar
Xpath short syntax: //foo//bar
Xpath full syntax: /descendant::foo/descendant::bar
Another combinator would be > for a child. This axis is the default one in Xpath.
CSS: foo > bar
Xpath short syntax: //foo/bar
Xpath full syntax: /descendant::foo/child::bar

PHP DOMXPath problem

$xpath = new DOMXpath($doc);
$res = $xpath->query(".//*[#id='post2679883']/tr[2]/td[2]/div[2]");
foreach( $res as $obj ) {
var_dump($obj->nodeValue);
}
I need to take all the items in the id with the word "post".
Example:
<div id="post2242424">trarata</div>
<div id="post114525">trarata</div>
<div id="post8568686">trarata</div>
Question number two:
I need to get this elements with HTML tags, but $obj->nodeValue returns text without html tags.
You could use the xpath function starts-with to filter the nodes in your XPath if all the nodes you want start with "post". For example;
$xpath->query(".//*[starts-with(#id, 'post')]/tr[2]/td[2]/div[2]");
For the second part, I think has been answered already - PHP DOMDocument stripping HTML tags

What xPath should I use to display the requested data?

I am using the following script to get the POST TITLE and the CONTENT of an RSS feed. The structure of it is: ( I guess i did not make any error)
<div id="feedBody">
<div id="feedContent">
<div class="entry">
<h3>TITLE OF POST</h3>
<div base="http://feeds.feedburner.com/blogspot/hyMBI"
class="feedEntryContent"
> CONTENT OF POST </div>
</div>
</div>
</div>
<?php
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://feeds.feedburner.com/blogspot/hyMBI');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('????????????????');
foreach($links as $link) {
printf("%s \n", $link->nodeValue);
}
?>
What xPath should I use to get the data? Is there any way of having them seperate?
Thanks a million, hopefully this is my last question on my project...
First, you should load the XML using load, not loadHTMLFile.
Judging by your variable name "$links", I guess you're wanting the values of the <link> elements inside the <item> elements. So construct an xpath query that says just that: //item/link.
Basic XPath: //div[#class="entry"] gets you an array of all entries. You can get the first (or only) entry with //div[#class="entry"][1]. With that, you can use h3 to get the text of the title node, and div[1] to get the contents (if it's guaranteed that there's only one, otherwise specify the class).
You can put them together like //div[#class="entry"][1]/h3 if you like, so that you only have to query the root node. Otherwise, save the new node for the next query, like:
$entries = $xPath->query('//div[#class="entry"][1]');
foreach($entry in $entries) {
$title = $xPath->evaluate('h3[1]',$entry);
$post = $xPath->evaluate('div[1]',$entry);
}
If your RSS returns a whole group of posts, you can leave off the first [1] and loop through the whole group this way.

Categories