PHP: Fetch content from a html page using xpath() - php

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:
<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>
I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[#id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.
.//*[#id='content']/span/following-sibling::p
.//*[#id='content']/node()[self::p]
This is how's used xpath:
$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);
And this is how i get html from nodes:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}

This XPath expression:
//div[#id='content']/p
Result in the wanted node set (five p elements)
EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
foreach ($nodelist as $node) {
$domDocument->appendChild($domDocument->importNode($node, true));
}
return $domDocument->saveHTML();
}

Related

Get h2 html using Simple HTML DOM Parser

I have the HTML web page with this code:
<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>
Now I want to use Simple HTML DOM Parser to get the text value of h2 in this div.
My code is:
$name = $html->find('h2[class="title-medium br-bottom"]');
echo $name;
But it always return an error: "
Notice: Array to string conversion in C:\xampp\htdocs\index.php on line 21
Array
How can I fix this error?
Can you try for Simple HTML DOM
foreach($html->find('h2') as $element){
$element->class;
}
There are other methods to parse
Method 1.
You can get the H2 tags using the following code snippet, using DOMDocument and getElementsByTagName
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
#$dom->loadHTML($received_str);
$h2tags = $dom->getElementsByTagName('h2');
foreach ($h2tags as $_h2){
echo $_h2->getAttribute('class');
echo $_h2->nodeValue;
}
Method2
Using the Xpath you can parse it
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($received_str);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//h2[#class='title-medium br-bottom']");
header("Content-type: text/plain");
foreach ($nodes as $i => $node) {
$node->nodeValue;
}

Attempted XPath query not showing any results

I'm currently working on a fantasy sports site, and I want to be able to pull basic stats from another site. (I don't have much experience with XML or pulling data from other sites).
I inspected the element to gain it's XPath:
Which gave me: //*[#id="cp1_ctl01_pnlPlayerStats"]/table[1]/tbody/tr[4]/td[18]
I've looked into a couple methods of trying to pull the info and came up with this:
But I just end up with empty elements in my table within my site:
Here's My Code:
$doc = new DOMDocument();
#$doc->loadHTMLFile($P_RotoLink);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//* [#id="cp1_ctl01_pnlPlayerStats"]/table[1]/tbody/tr[4]/td[18]');
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
A few things I've tried have thrown me errors, and any time I finally get pass them or suppress them I get empty content. I've tried a bunch of different formats but none seem to give me the desired content.
Edit: Here's the source HTML, I want to grab the value within the td (13.0).
Edit 2: So this is what I'm trying now:
$html = file_get_contents($P_RotoLink);
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXpath( $doc);
foreach ($xpath->query('//*[#id="cp1_ctl01_pnlPlayerStats"]/table//tr[4]/td[18]') as $node) {
$ppg = substr($node->textContent,0,3);
echo $ppg;
}
The problem is that the table in the screenshot doesn't have tbody node, but your XPath expression includes tbody which causes DOMXPath::query to return an empty list of nodes. I suggest ignoring tbody and fetching only rows with //tr.
Example
$html = <<<'HTML'
<div id="cp1_ctl01_pnlPlayerStats">
<table>
<tr></tr>
<tr>
<td><span>0.9</span>1.0<span>3.0</span></td><td>2.0</td>
</tr>
</table>
</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$expr = '//*[#id="cp1_ctl01_pnlPlayerStats"]/table//tr[2]/td[1]/text()';
$td = $xp->query($expr);
if ($td->length) {
var_dump($td[0]->nodeValue);
}
Output
string(3) "1.0"
The text() function selects all text node children of the context node.

How to find element in already parsed HTML data

Here I have a very simple code to grab all the 'div' elements with the classname 'info_block'. I am wondering how would I go about finding another element with the classname 'price' from within 'info_block' and display it instead of the whole 'info_block' element.
Main Goal: Find the price in each element with classname 'info_block'. but do inside the foreach, because I may need to find other elements.
<?php
$page = file_get_contents('example.com');
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$div1 = $xpath->query('//div[#class="info_block"]');
foreach ($div1 as $var1){
//echo $dom->saveHTML($var1);
}
?>
There is a element in each of the 'info_block' with a classname 'price' and I would like to display only that element. Like so...
foreach ($div1 as $var1){
$dom2 = new DOMDocument();
$dom2->loadHTML($dom->saveHTML($var1));
$xpath2 = new DOMXPath($dom2);
$div2 = $xpath2->query('//div[#class="price"]');
$div2 = $div2->item(0);
echo $dom2->saveHTML($div2);
}
But instead of just giving me the price it returns the whole HTML for 'info_block' as it did before.
You could provide each <div class="info_block"> found and search for <div class="price">" by providing it in the second argument of ->query():
$div1 = $xpath->query('//div[#class="info_block"]');
foreach ($div1 as $var1){
$div2 = $xpath->query('./div[#class="price"]', $var1);
// ^ each div
$div2 = $div2->item(0);
echo $dom->saveHTML($div2);
}
Note: You do not need to create another instance of DOM and DOMXpath.
This example is taken into context of this kind of HTML semantic:
<div class="info_block"> // each info block
<div class="price">1</div> // inside of it has price
</div>
<div class="info_block">
<div class="price">2</div>
</div>
You can combine queries in XPath to find all the desired elements in one go
$xpath->query('//div[#class="info_block"]|//div[#class="price"]');
You can specify dom elements for doing relative XPath queries. Its optional in xpath->query method
<?php
$page = file_get_contents('example.com');
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$div1 = $xpath->query('//div[#class="info_block"]');
foreach ($div1 as $var1){
$div2 = $xpath2->query('//a[#class="price"]', $var1);
foreach ($div2 as $var2) {
echo $var2->nodeValue. "\n";
}
}
?>
For more you can see xpath documentation here
xpath query documentation

Get href value from matching anchor text

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

Iterate through elements with DOMDocument & DOMXPath

I am trying to iterate through every child element of the containing div:
$html = ' <div id="roothtml">
<h1>
Introduction</h1>
<p>text</p>
<h2>
text</h2>
<p>
test</p>
</div>';
And I have this PHP:
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhitespace = false;
$xpath = new DOMXPath($dom);
$els = $xpath->query("/div");
print_r($els);
All I get though is DOMNodeList Object ( )
Having looked at the IBM tutorial I should be getting an array. What is it I am doing wrong?
Any help is appreciated.
You're using the wrong query string, you should be using //div.
Iterate over the list like this:
$els = $xpath->query("//div");
foreach( $els as $el) {
echo $el->textContent;
}

Categories