DOM XPath Selector not grabbing classes - php

I was looking through the following stackoverflow question: Getting Dom Elements By Class name and it referenced that I can get class names with this code:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
print "<pre>".print_r($nodes,true)."</pre>";
I also tried changing $classname to just one class:
$classname = 'someclass2';
I'm getting empty results. Any idea why?

You'll have to loop trough the results as print_r() will not print the members of a DOMNodeList. Like this:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
// iterate through the result. print_r will not suffer
foreach($nodes as $node) {
echo $node->nodeValue;
}

Related

iterate though all class blocks using DOM

I am scraping data from web page using DOM classes.
There are various blocks of div each with review, image, date, rate etc.
Here is code which scrap data for particular class. But here it scrap data for first class only. How can I iterate so that I can get details from all classes?
Here is my code:
libxml_use_internal_errors(true);
$html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80');
$html = escapeshellarg($html) ;
$html = nl2br($html);
$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
Output: http://codepad.viper-7.com/j0cTNi
UPDATE
http://codepad.viper-7.com/lHS9jk
Here I added :
$classname = 'review-wrapper';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
foreach($results as $node)
{
// scrapping code here
}
But it scrap same class value during each iteration. SEee result : http://codepad.viper-7.com/lHS9jk

DOMDocument removing html elements

Here is my code:
$text = '<div class="cgus_post"><div class="imgbox"><img src="/cgmedia/default.gif"></div>
<h2 id="post-15055">
Willie Nelson Celebrates 80th Birthday Stoned and Auditioning for Gandalf</h2>
<p>This video pretty much sums up why Willie Nelson is fucking awesome. Willie decided to celebrate his 80th birthday by recording an ‘audition’ for Peter Jackson. Willie wants to take the reigns from Ian McKellan in The Hobbit 2, and decided to show off his acting skills and give some of his own wizardly advice. The result is hilarious. Watch …</p>
<br class="clear">
</div>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'cgus_post';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach($nodes as $node){
echo $node->nodeValue;
}
The problem I am having is I am querying for the div that contains the class cgus_post and its returning just the text. How do I have it return the HTML elements also?
Here's my innerHTML function that I always use:
function innerHTML(DOMNode $node, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($node->childNodes as $inner_node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($inner_node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
So then you do:
$dom = new DOMDocument();
$dom->loadHTML($html);
echo htmlentities(innerHTML($dom->documentElement->childNodes->item(0)->firstChild));

PHP nodevalue stripping html tags

I have seem similar solutions else where but I haven't been able to convert to work with my own code.
I have a function that splits an html string between the paragraph tags and returns in an array. Code is as follows...
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//p");
$result = array();
foreach ($entries as $entry) {
$result[] = '<' . $entry->tagName . '>' . $entry->nodeValue . '</' . $entry->tagName . '>';
}
return $result;
Can someone assist me to remove the nodeValue element from this so it returns the paragraph content with html tags complete?
The html I am testing against is this: http://adam-makes-websites.com/tests/htmltest/test.html
A full test of what im doing with the code (as it stands with the suggestion to use ownerDocument->saveHTML applied) is here: http://adam-makes-websites.com/tests/htmltest/runtest.txt
The output from the test can be seen here: http://adam-makes-websites.com/tests/htmltest/runtest.php
You need to call saveHTML on the ownerDocument property:
$result[] = $entry->ownerDocument->saveHTML($entry);
$dom = new DOMDocument();
$dom->loadHTML($string);
$entries = $dom->getElementsByTagName('p');
$new_dom = new DOMDocument();
foreach ($entries as $entry) {
$new_dom->appendChild($new_dom->importNode($entry, TRUE));
}
$result = $new_dom->saveHTML()

PHP xpath contains class and does not contain class

The title sums it up. I'm trying to query an HTML file for all div tags that contain the class result and does not contain the class grid.
<div class="result grid">skip this div</div>
<div class="result">grab this one</div>
Thanks!
This should do it:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query(
"//div[contains(#class, 'result') and not(contains(#class, 'grid'))]");
foreach ($nodeList as $node) {
echo $node->nodeName . "\n";
}
Your XPath would be //div[contains(concat(' ', #class, ' '), ' result ') and not(contains(concat(' ', #class, ' '), ' grid '))]
The XPATH syntax would be...
//div[not(contains(#class, 'grid'))]

Using DOMDocument to extract from HTML document by class

In the DOMDocument class there are methods to get elements by by id and by tag name (getElementById & getElementsByTagName) but not by class. Is there a way to do this?
As an example, how would I select the div from the following markup?
<html>
...
<body>
...
<div class="foo">
...
</div>
...
</body>
</html>
The simple answer is to use xpath:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="foo"]')->item(0);
But that won't accept spaces. So to select by space separated class, use this query:
//*[contains(concat(' ', normalize-space(#class), ' '), ' class ')
$html = '<html><body><div class="foo">Test</div><div class="foo">ABC</div><div class="foo">Exit</div><div class="bar"></div></body></html>';
$dom = new DOMDocument();
#$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
$allClass = $xpath->query("//#class");
$allClassBar = $xpath->query("//*[#class='bar']");
echo "There are " . $allClass->length . " with a class attribute<br>";
echo "There are " . $allClassBar->length . " with a class attribute of 'bar'<br>";
In addition to ircmaxell's answer if you need to select by space separated class:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$classname='foo';
$div = $xpath->query("//table[contains(#class, '$classname')]")->item(0);

Categories