Get content and class attribute value from child nodes of the DOM - php

at first sorry of my bad english !
this is my simple cURL result :
<li class="result">
<div class="song_info">
<span class="artist_name">art1</span>
<span class="song_name">name1</span>
<span class="views">100 time</span>
</div>
</li>
//again
<li class="result">
<div class="song_info">
<span class="artist_name">art2</span>
<span class="song_name">name2</span>
<span class="views">200 time</span>
</div>
</li>
and many like that ....
i used this code to extract values from html :
$classname = 'song_info';
$dom = new DOMDocument;
$dom->loadHTML($html); // my html result .
$xpath = new DOMXPath($dom);
$get = $xpath->query("//*[#class='" . $classname . "']");
$text = $get->item(0)->nodeValue;
echo $text;
this code give me just first result :
art1
name1
100time
i want to get all results ! (Better in json)
can anyone help me ?

DOMXPath::query method returns DOMNodeList. It implements Traversable interface, therefore you can loop through it with foreach. Rename $get variable to $nodes, so the variable will explicitly show what is stored in it. Then:
foreach ($nodes as $curNode) {
$childNodes = $curNode->childNodes;
foreach ($childNodes as $curChildNode) {
// use $curChildNode->textContent to get content
// and $curChildNode->getAttribute('class') to get class name
}
}

I found My Answer
$text = $get->item(0)->nodeValue; >> Give First Result
$text = $get->item(1)->nodeValue; >> Give Second Result
I write a loop and receive all results :/

Related

Cannot Use parentNode with DomXpath in PHP

I'm having difficulty using parentNode with DomXpath.
<?php
$html = <<<STR
<div id="bar">
<p>item1</p>
<ul>
<li class="foo">item2</li>
<li>item3</li>
<li>item4</li>
</ul>
</div>
STR;
$doc = new DOMDocument;
$doc->loadHTML( $html );
$xpath = new DomXpath($doc);
$nodeFoo = $xpath->query("//*[#id='bar']//*[#class='foo']");
echo $nodeFoo->item(0)->nodeValue;
$nodeClimb = $nodeFoo->parentNode; // causes an error
echo $nodeClimb.nodeName;
?>
I expected that the last line yields 'ul' which is the parent node name of the retrieved node, $nodeFoo. What am I doing wrong?
Firstly, you have a typo on your last line: echo $nodeClimb.nodeName; should be echo $nodeClimb->nodeName;
However, your main problem is something that you've spotted on one line but not on the next: the XPath query returns not a single DOMNode, but an instance of DOMNodeList containing all the matches for that query.
So just as you have selected the first item in the list to echo (echo $nodeFoo->item(0)->nodeValue;), you need to select an item to assign as the parent ($nodeClimb = $nodeFoo->item(0)->parentNode;).

Traversing child nodes with PHP DOMXpath?

I'm having some trouble understanding what exactly is stored in childNodes. Ideally I'd like to do another xquery on each of the child nodes, but can't seem to get it straight. Here's my scenario:
Data:
<div class="something">
<h3>
Link text 1
</h3>
<div class"somethingelse">Something else text 1</div>
</div>
<div class="something">
<h3>
Link text 2
</h3>
<div class"somethingelse">Something else text 2</div>
</div>
<div class="something">
<h3>
Link text 3
</h3>
<div class"somethingelse">Something else text 3</div>
</div>
And the code:
$html = new DOMDocument();
$html->loadHtmlFile($local_file);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='something']");
foreach ($nodelist as $n) {
Can I run another query here? }
For each element of "something" (i.e., $n) I want to access the values of the two pieces of text and the href. I tried using childNode and another xquery but couldn't get anything to work. Any help would be greatly appreciated!
Yes you can run another xpath query, something like that :
foreach ($nodelist as $n)
{
$other_nodes = $xpath->query('div[#class="somethingelse"]', $n);
echo $other_nodes->length;
}
This will get you the inner div with the class somethingelse, the second argument of the $xpath->query method tells to query to take this node as context, see more http://fr2.php.net/manual/en/domxpath.query.php
If I understand your question correctly, it worked when I used the descendant:: expression. Try this:
foreach ($nodelist as $n) {
$other_nodes = $xpath->query('descendant::div[#class="some-descendant"]', $n);
echo $other_nodes->length;
echo $other_nodes->item(0)->nodeValue;
}
Although sometimes it's just enough to combine queries using the // path expression for narrowing your search. The // path expression selects nodes in the document starting from the current node that match the selector.
$nodes = $xpath->query('//div[#class="some-descendant"]//div[#class="some-descendant-of-that-descendant"]');
Then loop through those for the stuff you need. Hope this helps.
Trexx had it but he missed the last sentence of the question:
foreach ($nodelist as $n){
$href = $xpath->query('h3/a', $n)->item(0)->getAttribute('href');
$a_text = $xpath->query('h3/a', $n)->item(0)->nodeValue;
$div_text = $xpath->query('div', $n)->item(0)->nodeValue;
}
Here is a code snippet that allows you to access the information contained within each of the nodes with class attribute "something":
$nodes_tracker = 0;
$nodes_array = array();
foreach($nodelist as $n){
$info = $xpath->query('//h3//a', $n)->item($nodes_tracker)->nodeValue;
$extra_info = $xpath->query('//div[#class="somethingelse"', $n)->item($nodes_tracker)->nodeValue;
array_push($nodes_array, $info. ' - '. $extra_info . '<br>'); //Add each info to array
$nodes_tracker++;
}
print_r($nodes_array);`

why are these spans not getting treated as nodes by domdocument()?

the result of the following domdocument() call
$html = <<<EOT
<div class="list_item">
<div class="list_item_content">
<div class="list_item_title">
<a href="/link/goes/here">
INFO<br />
<span class="part2">More Info</span><br />
<span class="part3">Etc.</span>
</a>
</div>
</div>
EOT;
libxml_use_internal_errors(false);
$dom = new DOMDocument();
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
$titles_nodeList = $xpath->query('//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a');
foreach ($titles_nodeList as $title) {
$titles[] = $title->nodeValue;
}
echo("<pre>");
print_r($titles);
echo("</pre>");
?>
is
Array
(
[0] =>
INFOMore InfoEtc.
)
Why are data in these two spans inside the a element included in the result, when I am not specifying these spans in the path? I am interested only in retrieving data contained in the a element directly, not information contained in the spans inside the a element. I am wondering what I am doing wrong.
Try this xpath:
//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a/child::text()
The nodes are there, but are viewing them in HTML mode in a browser. Try viewing the page source, and/or doing:
echo("<pre>");
htmlspecialchars(print_r($titles), true);
echo("</pre>");
instead, which'll encode the <> into <> and make them "visible".

Grabbing links using xpath in php

i am trying to grab links from the Google search page. i am using the be below xpath to
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
grab the links. xPather evaluates it and gives the result. But when i use it with my php it doesn't show any result. Can someone please tell me what I am doing wrong? There is nothing wrong with the cURL.
below is my code
$dom = new DOMDocument();
#$dom->loadHTML($result);
$xpath=new DOMXPath($dom);
$elements = $xpath->evaluate("//div[#id='ires']/ol[#id='rso']/li/h3/a");
foreach ($elements as $element)
{
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
echo $link."<br>";
}
Sample Html provided by Robert Pitt
<li class="g w0">
<h3 class="r">
<em>LINK</em>
</h3>
<button class="ws" title=""></button>
<div class="s">
META
</div>
</li>
You can make life simpler by using the original XPath expression that you quoted:
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
Then, loop over the matching attributes like:
$hrefs = $xpath->evaluate(...);
foreach ($hrefs as $href) {
echo $href->value . "<br>";
}
Be sure to check whether any attributes were matched (var_dump($hrefs->length) would suffice).
Theres no element called href, thats an attribute:
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
You can just use
$link = $element->getAttribute('href');
did you try
$element->getElementsByTagName("a")
instead of
$element->getElementsByTagName("href")

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.
The information is in a div:-
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
The value I need is 15 but this could be any number depending on the site but will always be in the same position.
How could I read this value easily and assign it to a variable in PHP.
Thanks
Jonathan
You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.
(edit: oops, that's not what you're looking for, see second code snippet)
$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
</body></html>';
$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url)
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[#class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
edit: does this
<span class="page-numbers">15</span>
(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.
//div[#class="pager"] <- select each <div> where the attribute class equals "pager"
//div[#class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[#class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[#class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
( you might want to fetch a good XPath tutorial ;-) )
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
echo $nodelist->item(0)->nodeValue;
}
else {
echo 'not found';
}
There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.
You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:
// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15
This is something you would might want to use a xpath for - which requires loading the page as a dom document object:
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];
I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.
perhaps
$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
if( $node.class == "page-numbers" && $node.value > $maxPageNum )
{
$maxPageNum = $node.value;
}
}
I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.
Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-
function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
$lastpage = $nodelist->item(0)->nodeValue;
$users = $lastpage * 35;
$userurl = $userurl.'?page='.$lastpage;
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';
}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}
}

Categories