Cannot Use parentNode with DomXpath in PHP - php

I'm having difficulty using parentNode with DomXpath.
<?php
$html = <<<STR
<div id="bar">
<p>item1</p>
<ul>
<li class="foo">item2</li>
<li>item3</li>
<li>item4</li>
</ul>
</div>
STR;
$doc = new DOMDocument;
$doc->loadHTML( $html );
$xpath = new DomXpath($doc);
$nodeFoo = $xpath->query("//*[#id='bar']//*[#class='foo']");
echo $nodeFoo->item(0)->nodeValue;
$nodeClimb = $nodeFoo->parentNode; // causes an error
echo $nodeClimb.nodeName;
?>
I expected that the last line yields 'ul' which is the parent node name of the retrieved node, $nodeFoo. What am I doing wrong?

Firstly, you have a typo on your last line: echo $nodeClimb.nodeName; should be echo $nodeClimb->nodeName;
However, your main problem is something that you've spotted on one line but not on the next: the XPath query returns not a single DOMNode, but an instance of DOMNodeList containing all the matches for that query.
So just as you have selected the first item in the list to echo (echo $nodeFoo->item(0)->nodeValue;), you need to select an item to assign as the parent ($nodeClimb = $nodeFoo->item(0)->parentNode;).

Related

Get content and class attribute value from child nodes of the DOM

at first sorry of my bad english !
this is my simple cURL result :
<li class="result">
<div class="song_info">
<span class="artist_name">art1</span>
<span class="song_name">name1</span>
<span class="views">100 time</span>
</div>
</li>
//again
<li class="result">
<div class="song_info">
<span class="artist_name">art2</span>
<span class="song_name">name2</span>
<span class="views">200 time</span>
</div>
</li>
and many like that ....
i used this code to extract values from html :
$classname = 'song_info';
$dom = new DOMDocument;
$dom->loadHTML($html); // my html result .
$xpath = new DOMXPath($dom);
$get = $xpath->query("//*[#class='" . $classname . "']");
$text = $get->item(0)->nodeValue;
echo $text;
this code give me just first result :
art1
name1
100time
i want to get all results ! (Better in json)
can anyone help me ?
DOMXPath::query method returns DOMNodeList. It implements Traversable interface, therefore you can loop through it with foreach. Rename $get variable to $nodes, so the variable will explicitly show what is stored in it. Then:
foreach ($nodes as $curNode) {
$childNodes = $curNode->childNodes;
foreach ($childNodes as $curChildNode) {
// use $curChildNode->textContent to get content
// and $curChildNode->getAttribute('class') to get class name
}
}
I found My Answer
$text = $get->item(0)->nodeValue; >> Give First Result
$text = $get->item(1)->nodeValue; >> Give Second Result
I write a loop and receive all results :/

Is there an easy way to get subelements with DomDocument and DomXPath?

Supposed I have HTML like this:
<div id="container">
<li class="list">
Test text
</li>
</div>
And I want to get the contents of the li.
I can get the contents of the container div using this code:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo $dom->saveHTML($xpath->query("//div[#id='container']")->item(0));
I was hoping I could get the contents of the subelement by simply adding it to the query (like how you can do it in simpleHtmlDom):
echo $dom->saveHTML($xpath->query("//div[#id='container'] li[#class='list']")->item(0));
But a warning (followed by a fatal error) was thrown, saying:
Warning: DOMXPath::query(): Invalid expression ...
The only way I know of to do what I'm wanting is this:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
$dom2 = new \DomDocument;
$dom2->loadHTML(trim($dom->saveHTML($xpath->query("//div[#id='container']")->item(0))));
$xpath2 = new \DomXPath($dom2);
echo $xpath2->query("//li[#class='list']")->item(0)->nodeValue;
However, that's an awful lot of code just to get the contents of the li, and the problem is that as items are nested deeper (like if I want to get `div#container ul.container li.list) I have to continue adding more and more code.
With simpleHtmlDom, all I would have had to do is:
$html->find('div#container li.list', 0);
Am I missing an easier way to do things with DomDocument and DomXPath, or is it really this hard?
You were close in your initial attempt; your syntax was just off by a character. Try the following XPath:
//div[#id='container']/li[#class='list']
You can see you had a space between the div node and the li node where there there should be a forward slash.
SimpleHTMLDOM uses CSS selectors, not Xpath. About anything in CSS selectors can be done with Xpath, too. DOMXpath::query() does only support Xpath expression that return a node list, but Xpath can return scalars, too.
In Xpath the / to separates the parts of an location path, not a space. It has two additional meanings. A / at the start of an location path makes it absolute (it starts at the document and not the current context node). A second / is the short syntax for the descendant axis.
Try:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo trim($xpath->evaluate("string(//div[#id='container']//li[#class='list'])"));
Output:
Test text
In CSS selector sequences the space is a combinator for two selectors.
CSS: foo bar
Xpath short syntax: //foo//bar
Xpath full syntax: /descendant::foo/descendant::bar
Another combinator would be > for a child. This axis is the default one in Xpath.
CSS: foo > bar
Xpath short syntax: //foo/bar
Xpath full syntax: /descendant::foo/child::bar

Count the Number of Specified Tags in the First Child Node in PHP

To count the number of a specified tag including nested tags, it's simple like this,
<?php
$html = <<<STR
<ul>
<li>item1</li>
<ul>
<li>item2</li>
<li>item3</li>
<li>item4</li>
</ul>
</ul>
STR;
$doc = new DOMDocument;
$doc->loadHTML( $html );
$nodeUl->getElementsByTagName('ul')->item(0);
echo $nodeUl->getElementsByTagName('li')->length;
?>
But if I want to count the li tag in this case only in the first child node, how can it be achieved? I mean in this case it should be only one, not four.
Maybe remove other tags and count it? Or is there a better way of doing it?
The trouble is that getElementsByTagName() returns all ancestor elements (with the specified tag name), rather than just children.
There are a couple of different approaches that you could take, here are two of them.
Loop over child nodes and count the <li> elements
$count = 0;
foreach ($nodeUl->childNodes as $child) {
if ($child->nodeName === 'li') {
$count++;
}
}
Use XPath to query (and count) only child <li> elements
$xpath = new DOMXPath($doc);
$count = $xpath->evaluate('count(li)', $nodeUl);
Resources
childNodes property
nodeName property
DOMXPath class
count() XPath function
Try the following:
$doc = new DOMDocument;
$doc->loadHTML($html);
foreach($doc->getElementsByTagName('ul') as $ul) {
$count = $ul->getElementsByTagName('li')->length;
break;
}

why are these spans not getting treated as nodes by domdocument()?

the result of the following domdocument() call
$html = <<<EOT
<div class="list_item">
<div class="list_item_content">
<div class="list_item_title">
<a href="/link/goes/here">
INFO<br />
<span class="part2">More Info</span><br />
<span class="part3">Etc.</span>
</a>
</div>
</div>
EOT;
libxml_use_internal_errors(false);
$dom = new DOMDocument();
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
$titles_nodeList = $xpath->query('//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a');
foreach ($titles_nodeList as $title) {
$titles[] = $title->nodeValue;
}
echo("<pre>");
print_r($titles);
echo("</pre>");
?>
is
Array
(
[0] =>
INFOMore InfoEtc.
)
Why are data in these two spans inside the a element included in the result, when I am not specifying these spans in the path? I am interested only in retrieving data contained in the a element directly, not information contained in the spans inside the a element. I am wondering what I am doing wrong.
Try this xpath:
//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a/child::text()
The nodes are there, but are viewing them in HTML mode in a browser. Try viewing the page source, and/or doing:
echo("<pre>");
htmlspecialchars(print_r($titles), true);
echo("</pre>");
instead, which'll encode the <> into <> and make them "visible".

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.
The information is in a div:-
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
The value I need is 15 but this could be any number depending on the site but will always be in the same position.
How could I read this value easily and assign it to a variable in PHP.
Thanks
Jonathan
You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.
(edit: oops, that's not what you're looking for, see second code snippet)
$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
</body></html>';
$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url)
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[#class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
edit: does this
<span class="page-numbers">15</span>
(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.
//div[#class="pager"] <- select each <div> where the attribute class equals "pager"
//div[#class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[#class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[#class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
( you might want to fetch a good XPath tutorial ;-) )
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
echo $nodelist->item(0)->nodeValue;
}
else {
echo 'not found';
}
There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.
You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:
// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15
This is something you would might want to use a xpath for - which requires loading the page as a dom document object:
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];
I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.
perhaps
$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
if( $node.class == "page-numbers" && $node.value > $maxPageNum )
{
$maxPageNum = $node.value;
}
}
I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.
Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-
function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
$lastpage = $nodelist->item(0)->nodeValue;
$users = $lastpage * 35;
$userurl = $userurl.'?page='.$lastpage;
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';
}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}
}

Categories