Supposed I have HTML like this:
<div id="container">
<li class="list">
Test text
</li>
</div>
And I want to get the contents of the li.
I can get the contents of the container div using this code:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo $dom->saveHTML($xpath->query("//div[#id='container']")->item(0));
I was hoping I could get the contents of the subelement by simply adding it to the query (like how you can do it in simpleHtmlDom):
echo $dom->saveHTML($xpath->query("//div[#id='container'] li[#class='list']")->item(0));
But a warning (followed by a fatal error) was thrown, saying:
Warning: DOMXPath::query(): Invalid expression ...
The only way I know of to do what I'm wanting is this:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
$dom2 = new \DomDocument;
$dom2->loadHTML(trim($dom->saveHTML($xpath->query("//div[#id='container']")->item(0))));
$xpath2 = new \DomXPath($dom2);
echo $xpath2->query("//li[#class='list']")->item(0)->nodeValue;
However, that's an awful lot of code just to get the contents of the li, and the problem is that as items are nested deeper (like if I want to get `div#container ul.container li.list) I have to continue adding more and more code.
With simpleHtmlDom, all I would have had to do is:
$html->find('div#container li.list', 0);
Am I missing an easier way to do things with DomDocument and DomXPath, or is it really this hard?
You were close in your initial attempt; your syntax was just off by a character. Try the following XPath:
//div[#id='container']/li[#class='list']
You can see you had a space between the div node and the li node where there there should be a forward slash.
SimpleHTMLDOM uses CSS selectors, not Xpath. About anything in CSS selectors can be done with Xpath, too. DOMXpath::query() does only support Xpath expression that return a node list, but Xpath can return scalars, too.
In Xpath the / to separates the parts of an location path, not a space. It has two additional meanings. A / at the start of an location path makes it absolute (it starts at the document and not the current context node). A second / is the short syntax for the descendant axis.
Try:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo trim($xpath->evaluate("string(//div[#id='container']//li[#class='list'])"));
Output:
Test text
In CSS selector sequences the space is a combinator for two selectors.
CSS: foo bar
Xpath short syntax: //foo//bar
Xpath full syntax: /descendant::foo/descendant::bar
Another combinator would be > for a child. This axis is the default one in Xpath.
CSS: foo > bar
Xpath short syntax: //foo/bar
Xpath full syntax: /descendant::foo/child::bar
Related
I've been working on some practice application involving xPaths and Retrieving elements from other website.
I used DomXpath for it but it is not returning a result or nodelist.
Here's the code:
$DOM = new DOMDocument();
#$DOM->loadHTML($html);
$xpath = new DOMXPath($DOM);
$nodes = $xpath->query("//span[contains(#style,'font-size:25px;')]");
foreach ($nodes as $node) {
echo $node->nodeValue;
}
The page source of the example:
<div class="front-view-content full-post">
<p>
<span style="font-size:25px; color:#98293D;">RED</span><br>
<span style="font-size:25px; color:#98293D;">BLUE</span><br>
<span style="font-size:25px; color:#98293D;">WHITE</span></p>
</div>
it doesn't return anything just a plain blank.
There is no semicolon ; in the source, so xpath doesn't match.
$nodes = $finder->query('//span[#style="font-size:25px"]');
Should work
Trying to match attributes that contain a certain value is a little more complicated than just doing [#style="your search string"], as this will only match a style attribute that exactly matches your search string.
To my knowledge there's no shorthand selectors in xpath, similar to the ones in CSS for instance, that allows you to do [#style*="your search string"] or [#style~="your search string"], etc.
To test if a string contains another string, you use the contains() function. You're example xpath query would then have to be transformed to:
//span[contains(#style,"font-size:25px;")]
Be aware though that matching isolated strings, at the word boundary if you will, (such as matching the class main in class="nav main", but not in class="nav maintenance", for instance), gets a little more complicated, still. I'll refer you to this answer for such an example.
I am trying to get the contents using XPATH in php.
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
I am using below php code to get the output.
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$xpath->registerPhpFunctions('preg_match');
$regex = 'post-(content|[a-z]+)';
$items = $xpath->query("div[ php:functionString('preg_match', '$regex', #class) > 0]");
dd($items);
It returns output as below
DOMNodeList {#580
+length: 0
}
Here is a working version with the different advices you get in comments:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// you need to register the namespace "php" to make it available in the query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// add delimiters to your pattern
$regex = '~post-(content|[a-z]+)~';
// search your node anywhere in the DOM tree with "//"
$items = $xpath->query("//div[php:functionString('preg_match', '$regex', #class)>0]");
var_dump($items);
Obviously, this kind of pattern is useless since you can get the same result with available XPATH string functions like contains.
For a simple task like this - getting the div nodes with class attribute starting with post- and containing content, you should be using regular simple XPath queries:
$xp->query('//div[starts-with(#class,"post-") and contains(#class, "content")]');
Here,
- //div - get all divs that...
- starts-with(#class,"post-") - have "class" attribute starting with "post-"
- and - and...
- contains(#class, "content") - contain "content" substring in the class attribute value.
To use the php:functionString you need to register the php namespace (with $xpath->registerNamespace("php", "http://php.net/xpath");) and the PHP functions (to register them all use $xp->registerPHPFunctions();).
For complex scenrios, when you need to analyze the values even deeper, you may want to create and register your own functions:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
and then inside XPath:
$divs = $xp->query("//div[php:functionString('example', #class)]");
Here, functionString passes the string contents of #class attribute to the example function, not the object (as would be the case with php:function).
See IDEONE demo:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
$html = <<<HTML
<body>
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
</body>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions('example');
$divs = $xp->query("//div[php:functionString('example', #class)]");
foreach ($divs as $div) {
echo $div->nodeValue;
}
See also a nice article about the using of PhpFunctions inside XPath in Using PHP Functions in XPath Expressions.
I have a huge file with lots of entries, they have one thing in common, the first line. I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML into a DOM like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice' I am sure there is a simple way of doing this using DOM methods or XPath, please advise!
Speaking of XPath, try the following expression which selects<p> elements:
whose <b> child element (first one) has the value Type of document:
whose next sibling text node (first one) contains the text Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]
With this XPath expression, you select the text of all children of the p element:
//b[text()="Type of document:"]/parent::p/*/text()
I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text
I'm trying to do Xpath queries on DOMElements but it doesn't seem to work. Here is the code
<html>
<div class="test aaa">
<div></div>
<div class="link">contains a link</div>
<div></div>
</div>
<div class="test bbb">
<div></div>
<div></div>
<div class="link">contains a link</div>
</div>
</html>
What I'm doing is this:
$dom = new DOMDocument();
$html = file_get_contents("file.html");
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//div[contains(#class,'test')]");
if (!$entries->length > 0) {
echo "Nothing\n";
} else {
foreach ($entries as $entry) {
$link = $xpath->query('/div[#class=link]',$entry);
echo $link->item(0)->nodeValue;
// => PHP Notice: Trying to get property of non-object
}
}
Everything works fine up to $xpath->query('/div[#class=link], $entry);. I don't know how to use Xpath on a particular DOMElement ($entry).
How can I use xpath queries on DOMElement?
It looks like you're trying to mix CSS selectors with XPath. You want to be using a predicate ([...]) looking at the value of the class attribute.
For example, your //div.link might look like //div[contains(concat(' ',normalize-space(#class),' '),' link ')].
Secondly, within the loop you try to make a query with a context node then ignore that by using an absolute location path (it starts with a slash).
Updated to reflect changes to the question:
Your second XPath expression (/div[#class=link]) is still a) absolute, and b) has an incorrect condition. You want to be asking for matching elements relative to the specified context node ($entry) with the class attribute having a string value of link.
So /div[#class=link] should become something like div[#class="link"], which searches children of the $entry elements (use .//div[...] or descendant::div[...] if you want to search deeper).
Let's say I have the following HTML snippet:
<div abc:section="section1">
<p>Content...</p>
</div>
<div abc:section="section2">
<p>Another section</p>
</div>
How can I get a DOMNodeList (in PHP) with a DOMNode for each of <div>'s with the abc:section attribute set.
Currently I have the following code
$dom = new DOMDocument();
$dom->loadHTML($html)
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('abc', 'http://xml.example.com/AbcDocument');
Following XPath's won't work:
$xpath->query('//#abc:section');
$xpath->query('//*[#abc:section]');
The loaded HTML is always just a snippet, I'm transforming this using the DOMDocument functions and feeding that to the template.
The loadHTML method will trigger the HTML Parser module of libxml. Afaik, the resulting HTML tree will not contain namespaces, so querying them with XPath wont work here. You can do
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
foreach ($dom->getElementsByTagName('div') as $node) {
echo $node->getAttribute('abc:section');
}
echo $dom->saveHTML();
As an alternative, you can use //div/#* to fetch all attributes and that would include the namespaced attributes. You cannot have a colon in the query though, because that requires the namespace prefix to be registered but like pointed out above, that doesnt work for an HTML tree.
Yet another alternative would be to use //#*[starts-with(name(), "abc:section")].