How to parse body class with Xpath? - php

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

Related

Why does not display the attribute html via xpath php

Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier
The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);
If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...

Xpath nodeValue/textContent unable to see <BR> tag

HTML is as follows:
ABC<BR>DEF
However, both nodeValue and textContent attributes show "ABCDEF" as the value.
Any way to show or parse the <BR>?
Maybe this'll help you: DOMNode::C14N
It'll return the HTML of the node.
<?php
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
#$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}
Demo
I know you have already solved your problem, but I wanted to add a more direct way of solving it...
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
$doc->loadHTML($a);
$xp = new DomXPath($doc);
$nodes = $xp->query("//a/node()");
$text = '';
foreach ($nodes as $node) {
$text .= $doc->saveHTML($node);
}
echo $text;
Outputs...
ABC<br>DEF

Undefined property: DOMNodeList::$textContent when to parse web

In my code1,it can parse the web to get the td content for me.
code1
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr//td[position()<4 and position()>1]');
foreach($nodes as $node){
echo $node->textContent.'</br>';}
?>
Now i change other format to parse the web.
code2
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr');
foreach($nodes as $node){
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
echo $sub->textContent.'</br>';}
?>
Is the xpath expression wrong ?
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
It is the result of my code1.
According to har07's answer ,code2 was rewrite as code3,there is another problem remain,please test it with my code3 .
code3
<?php
$url='http://www.sse.com.cn/marketservices/tradingservice/shhksc/eligible/';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#id="hk_view"]//table[#class="tablestyle"]//tr');
foreach($nodes as $node){
$sub =$xpath->query('//td[position()<4 and position()>1]' ,$node);
foreach($sub as $s){
echo $s->textContent.'</br>';
}
}
?>
The problem isn't in the xpath expression you use. As the error message suggests, query() returns DOMNodeList which doesn't have textContent property. It is DOMNode that have textContent.
You need to iterate through the DOMNodeList to access it's individual DOMNode member, and access textContent property on each DOMNode :
foreach($nodes as $node){
$sub = $xpath->query('.//td[position()<4 and position()>1]' ,$node);
foreach($sub as $s){
echo $s->textContent;
}
}

simple_html_dom for faster retrieval

I am trying to do some web scraping using simple_html_dom. But I just want inner text of a span element only. Do I have to load the entire page for that? It is taking a lot of time since I am running it in a loop. What are other alternatives to do this faster?
Here is what I am doing now-
$html = file_get_html($url);
foreach($html->find('span') as $element) {
if($element->innertext=="some text") {
$html->clear();
unset($html);
break;
}
else {
//do something
}
This is too slow if this is used inside a loop. Faster way to do this?
I am not sure about the speed, but instead of doing foreach loop, you can do something like this
$html->find( $selector, $idx )
<?php
$html = file_get_html( $url );
if ( is_object( $html ) ) {
if ( $span = $html->find( "span", 0 ) ) {
$span->innertext = "some text";
}
}
?>
You could give the following a try:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span")->item(0)->nodeValue;
echo $content;
Fastest will be:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span[contains(text(), 'some text')]")->item(0)->nodeValue;

Simple HTML DOM gets only 1 element

I'm following a simplified version of the scraping tutorial by NetTuts here, which basically finds all divs with class=preview
http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/comment-page-1/#comments
This is my code. The problem is that when I count $items I get only 1, so it's getting only the first div with class=preview, not all of them.
$articles = array();
$html = new simple_html_dom();
$html->load_file('http://net.tutsplus.com/page/76/');
$items = $html->find('div[class=preview]');
echo "count: " . count($items);
Try using DOMDocument and DOMXPath:
$file = file_get_contents('http://net.tutsplus.com/page/76/');
$dom = new DOMDocument();
#$dom->loadHTML($file);
$domx = new DOMXPath($dom);
$nodelist = $domx->evaluate("//div[#class='preview']");
foreach ($nodelist as $node) { print $node->nodeValue; }

Categories