Fetching value of specific text node using DOMXPath - php

From the following structure:
I'm trying to fetch the marked text with the following code:
$price_new='div/div[#class="cat_price"]/text()';
if ($price_new!=null && $node = $Website_Xpath->query ($price_new, $row )) {
$result [$value] ['Price'] = $node->item( 0 )->nodeValue;
} else {
$result [$value] ['Price'] = "";
}
but the node value is NULL. How do I fetch the number correctly?

You should provide the actual snippet, not just a screenshot of it. If I interpreted the screenshot correctly the snippet is something like:
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
"
64,9999"<span> - PKR</span>
</div>
</body>
XML;
The text node with the price is the following sibling of the div with the class was. So it is possible to fetch it using that axis:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$expression = 'string(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1])';
var_dump($xpath->evaluate($expression));
Unlike DOMXpath::query(), DOMXpath::evaluate() can return scalar values depending on the expression. A string cast or a string function will return a string.
string(25) "
"
64,9999""
However the result will not only contain the number but the quotes and some whitespaces. translate() and normalize-space() could be used to clean it up:
$expression = 'normalize-space(
translate(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1], \'"\', " ")
)';
var_dump($xpath->evaluate($expression));
Output:
string(7) "64,9999"

Your $Website_Xpath looks like an object of DOMXPath. Then the main issue with your code is in the XPath expression: 'div/div[#class="cat_price"]/text()'. You are trying to fetch a div from nowhere. Whether provide full path from the root node (e.g. /html/body/div), or select all divs with // prefix.
Example
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
64,9999<span> - PKR</span>
</div>
</body>
XML;
$doc = new DOMDocument();
$doc->loadXML($xml);
$text = '';
$xpath = new DOMXPath($doc);
// Select all text nodes within a <div> having class="cat_price"
if ($nodes = $xpath->query('//div[#class="cat_price"]/text()')) {
// Search for a node with some content, except spaces
foreach ($nodes as $n) {
if ($text = trim($n->nodeValue))
break;
}
}
var_dump($text);
Output
string(7) "64,9999"

Related

How to query a xml file using xpath (php) ?

I am trying to query an XML file using XPath. But as return I get nothing. I think I formatted the query false.
XML
<subject id="Tom">
<relation unit="ITSupport" role="ITSupporter" />
</subject>
PHP
$xpath = new DOMXpath($doc);
$role = 'ITSupporter';
$elements = $xpath-> query("//subject/#id[../relation/#role='".$role."']");
foreach ($elements as $element) {
$name = $element -> nodeValue;
$arr[$i] = $name;
$i = $i + 1;
}
How can I get the id TOM? I want to save it to for example $var
Building up the Xpath expression:
Fetch any subject element//subject
... with a child element relation//subject[relation]
... that has a role attribute with the given text//subject[relation/#role="ITSupporter"]
... and get the #id attribute of subject//subject[relation/#role="ITSupporter"]/#id
Additionally the source could be cleaned up. PHP arrays can use the $array[] syntax to push new elements into them.
Put together:
$xml = <<<'XML'
<subject id="Tom">
<relation unit="ITSupport" role="ITSupporter" />
</subject>
XML;
$role = 'ITSupporter';
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$ids = [];
foreach ($xpath->evaluate("//subject[relation/#role='".$role."']/#id") as $idAttribute) {
$ids[] = $idAttribute->value;
}
var_dump($ids);
Output:
array(1) {
[0]=>
string(3) "Tom"
}
If you expect only a single result you can cast the it in Xpath:
$id = $xpath->evaluate(
"string(//subject[relation/#role='".$role."']/#id)"
);
var_dump($id);
Output:
string(3) "Tom"
XML Namespaces
Looking at the example posted in the comment your XML uses the namespace http://cpee.org/ns/organisation/1.0 without a prefix. The XML parser will resolve it so you can read the nodes as {http://cpee.org/ns/organisation/1.0}subject. Here are 3 examples that all resolve to this:
<subject xmlns="http://cpee.org/ns/organisation/1.0"/>
<cpee:subject xmlns:cpee="http://cpee.org/ns/organisation/1.0"/>
<c:subject xmlns:c="http://cpee.org/ns/organisation/1.0"/>
The same has to happen for the Xpath expression. However Xpath does not have
a default namespace. You need to register an use an prefix of your choosing. This
allows the Xpath engine to resolve something like //org:subject to //{http://cpee.org/ns/organisation/1.0}subject.
The PHP does not need to change much:
$xml = <<<'XML'
<subject id="Tom" xmlns="http://cpee.org/ns/organisation/1.0">
<relation unit="ITSupport" role="ITSupporter" />
</subject>
XML;
$role = 'ITSupporter';
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
// register a prefix for the namespace
$xpath->registerNamespace('org', 'http://cpee.org/ns/organisation/1.0');
$ids = [];
// address the elements using the registered prefix
$idAttributes = $xpath->evaluate("//org:subject[org:relation/#role='".$role."']/#id");
foreach ($idAttributes as $idAttribute) {
$ids[] = $idAttribute->value;
}
var_dump($ids);
Try this XPath
//subject[relation/#role='".$role."']/#id
You were applying the predicate on the id attribute and not on the subject element.
Getting element by id is the same as doing by $role contents.
So, like the followings;
$xpath->query("//*[#id='$id']")->item(0);
In other words, #id should be in '[' bracket.

PHP DOM parsing text between <hr> tags

I am trying to parse some HTML to get the text between two <hr> tags using DOM with PHP but I don't get any output when I pass in hr into getElementsByTagName:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<hr>Text<hr>");
$hr = $dom->getElementsByTagName("hr");
for ($i=0; $i<$hr->length; $i++) {
echo "[". $i . "]" . $hr->item($i)->nodeValue . "</br>";
}
?>
When I run this code, it doesn't output anything however, if I change "hr" to "*" then it outputs:
[0]Text
[1]Text
[2]
[3]
(Why four lines of results?)
I run this code on a webserver which has PHP version 7.1.3 running. I can't use functions such as file_get_html or str_get_html because it returns an error about Undefined call to function ...
Why doesn't the hr tag produce results?
Perhaps what you're looking for is the contents of the text node between two <hr> elements? In that case we go looking for siblings with an XPath expression:
<?php
$dom = new DOMDocument();
$dom->loadHTML("Some text<hr>The text<hr>Other text");
$xp = new DomXPath($dom);
$result = $xp->query("//text()[(preceding-sibling::hr and following-sibling::hr)]");
foreach ($result as $i=>$node) {
echo "[$i]$node->textContent<br/>\n";
}
This happens, because the <hr> has no child nodes (text are also childs).
To get the text between the <hr> nodes, you have to iterate over all nodes on the same level and check if the current node is a text node (nodeType == 3), the previous sibling must be a HR node and the next sibling must be a HR node too.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<hr>Text<hr>");
foreach ($dom->childNodes as $childNode) {
if (3 !== $childNode->nodeType) {
continue;
}
if (!$childNode->previousSibling || ('HR' !== $childNode->previousSibling->nodeName)) {
continue;
}
if (!$childNode->nextSibling || ('HR' !== $childNode->nextSibling->nodeName)) {
continue;
}
echo "{$childNode->nodeValue}\n";
}
But if you want to get anything between the hr nodes it will be more complicated.

query html table using xpath - remove td from the result

I have a HTML table with class name list.
I'm using the following query to get the data.
$elements = $xpath->query("//table[#class='list']/tr/td");
$result = $dom_object->saveHTML($elements->item(0));
var_dump($result);
It works fine. Except that it adds the td in the result.
I mean the result look like this
<td>
result data
</td>
Can someone tell me how to remove the td tag from the result data?
Maybe you're looking for something like
<?php
$doc = new DOMDocument;
$doc->loadhtml( data() );
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//table[#class='list']/tr/td");
// 1)
$result = (string)$elements->item(0)->nodeValue;
var_dump($result);
// 2)
$frag = $doc->createDocumentFragment();
$node = $elements->item(0)->firstChild;
while( $node ) {
$frag->appendChild( $node->cloneNode(true) );
$node = $node->nextSibling;
}
$result = $doc->saveXML($frag);
var_dump($result);
function data() {
return <<< eoh
<html>
<head><title>...</title></head>
<body>
<table class="list">
<tr><td>result data<br />foo</td></tr>
<tr><td>...</td></tr>
</table>
</body>
</html>
eoh;
}
prints
string(14) "result datafoo"
string(19) "result data<br/>foo"
If there is only one text node per cell (ie. no other markup), you can go for
//table[#class='list']/tr/td/text()
which selects all text nodes inside the <td/>. If there is markup but still only a single text node like in <td><em>foo</em></td>, you could use
//table[#class='list']/tr/td//text()
If it contains more than one text node, you will receive multiple result nodes which are not grouped by table cell any more.

Getting text content with xpath

I have some HTML like this:
<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>
What's the xpath to get $58.00 back as one string?
I'm using PHP:
$xpath = '?????';
$result = $xml->xpath($xpath);
echo $result[0]; // want this to show $58.00, possible?
These are valid in your case, check for more detail the links below;
$html = '<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>';
$dom = new DOMDocument();
$dom->loadXML($html);
$xpt = new DOMXpath($dom);
foreach ($xpt->query('//dd[#class="price"]') as $node) {
// outputs: $58.00
echo trim($node->nodeValue);
}
// or
$xml = new SimpleXMLElement($html);
$res1 = $xml->xpath('//dd[#class="price"]/sup');
$res2 = $xml->xpath('//dd[#class="price"]/span');
// outputs: $58.00
printf('%s%s%s', (string) $res1[0], (string) $res2[0], (string) $res1[1]);
DOMDocument
DOMXPath
SimpleXMLElement
data() will return all contents inside the current context. Try
//dd/data()
You haven't shown us your code so I don't know what platform you're using. If you have something that can evaluate non-node XPath expressions, then you can use this:
string(//dd[#class = 'price'])
if not, you can select the node,
//dd[#class = 'price']
and the API you're using should have a way of getting the inner text value of the selected node.

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

I would like to cut off some portion of html, I can take it by using XPath and DomDocument but the problem is that I need result as a html code string. Normally I would use reg. expr. for that but I wouldn't like to do a complicated search pattern that would mach the begining and the end of tag.
That's the example input:
some html code before
<div>this <b>is</b> what I want</div>
some html after
and the output:
<div>this <b>is</b> what I want</div>
I tried something like this:
subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div/*");
echo $result->saveHTML();
but i got only error:
Call to undefined method DOMNodeList::saveHTML()
Does anyone know how to get the result as a html string by using DomDocument and XPath?
Thank you Gentleman for pointing out my missunderstanding with accessing methods that are not aviailable in a child object. But line:
echo $doc->saveHTML($result->item(0));
generates only warning (without the html sting I want to have). Luckily I found another soulution and here it is:
<?php
$subject = '<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
//echo $doc->saveHTML($result->item(0));
echo domNodeList_to_string($result);
function domNodeList_to_string($DomNodeList) {
$output = '';
$doc = new DOMDocument;
while ( $node = $DomNodeList->item($i) ) {
// import node
$domNode = $doc->importNode($node, true);
// append node
$doc->appendChild($domNode);
$i++;
}
$output = $doc->saveHTML();
$output = print_r($output, 1);
// I added this because xml output and ajax do not like each others
//$output = htmlspecialchars($output);
return $output;
}
php>
so if one has a query like that:
$result = $xpath->query("//div");
then will get the raw html string output:
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
if the query is:
$result = $xpath->query("//p");
then output will be:
<p style="text-align:right">Written by Kovid Goyal</p><p>A very short ebook to demonstrate the use of XPath.</p><p>This is a truly fascinating chapter.</p><p>A worthy continuation of a fine tradition.</p>
Does anyone know simpler (embeded in php) method to get the same result?
Try this:
$subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0)); //echoes what you want :)
The saveHTML function belongs to the DOMDocument object, you can't call it directly on the node (much less on the NodeList, which is what the query returns), but what you can do is pass it the node as a param.
Also, your query was wrong: what you want is the div element (i.e. //div), not its children (//div/*).
As per the php manual docs on DOMXPath::querydocs, the function:
Returns a DOMNodeList containing all nodes matching the given XPath
expression. Any expression which does not return nodes will return an
empty DOMNodeList.
This means that the $result in the following code will be a DOMNodeListdocs object. So if you want to get individual HTML code out from inside it you'll need to use methods available with a DOMNodeList object. In this case, the item method:
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0));
$result->item(0) returns the first DOMNode in the DOMNodeList created by your xpath query.
Try this :
$subject = 'some html code before<div>this <b>is</b> what I want</div>some html after';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

Categories