query html table using xpath - remove td from the result - php

I have a HTML table with class name list.
I'm using the following query to get the data.
$elements = $xpath->query("//table[#class='list']/tr/td");
$result = $dom_object->saveHTML($elements->item(0));
var_dump($result);
It works fine. Except that it adds the td in the result.
I mean the result look like this
<td>
result data
</td>
Can someone tell me how to remove the td tag from the result data?

Maybe you're looking for something like
<?php
$doc = new DOMDocument;
$doc->loadhtml( data() );
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//table[#class='list']/tr/td");
// 1)
$result = (string)$elements->item(0)->nodeValue;
var_dump($result);
// 2)
$frag = $doc->createDocumentFragment();
$node = $elements->item(0)->firstChild;
while( $node ) {
$frag->appendChild( $node->cloneNode(true) );
$node = $node->nextSibling;
}
$result = $doc->saveXML($frag);
var_dump($result);
function data() {
return <<< eoh
<html>
<head><title>...</title></head>
<body>
<table class="list">
<tr><td>result data<br />foo</td></tr>
<tr><td>...</td></tr>
</table>
</body>
</html>
eoh;
}
prints
string(14) "result datafoo"
string(19) "result data<br/>foo"

If there is only one text node per cell (ie. no other markup), you can go for
//table[#class='list']/tr/td/text()
which selects all text nodes inside the <td/>. If there is markup but still only a single text node like in <td><em>foo</em></td>, you could use
//table[#class='list']/tr/td//text()
If it contains more than one text node, you will receive multiple result nodes which are not grouped by table cell any more.

Related

Fetching value of specific text node using DOMXPath

From the following structure:
I'm trying to fetch the marked text with the following code:
$price_new='div/div[#class="cat_price"]/text()';
if ($price_new!=null && $node = $Website_Xpath->query ($price_new, $row )) {
$result [$value] ['Price'] = $node->item( 0 )->nodeValue;
} else {
$result [$value] ['Price'] = "";
}
but the node value is NULL. How do I fetch the number correctly?
You should provide the actual snippet, not just a screenshot of it. If I interpreted the screenshot correctly the snippet is something like:
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
"
64,9999"<span> - PKR</span>
</div>
</body>
XML;
The text node with the price is the following sibling of the div with the class was. So it is possible to fetch it using that axis:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$expression = 'string(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1])';
var_dump($xpath->evaluate($expression));
Unlike DOMXpath::query(), DOMXpath::evaluate() can return scalar values depending on the expression. A string cast or a string function will return a string.
string(25) "
"
64,9999""
However the result will not only contain the number but the quotes and some whitespaces. translate() and normalize-space() could be used to clean it up:
$expression = 'normalize-space(
translate(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1], \'"\', " ")
)';
var_dump($xpath->evaluate($expression));
Output:
string(7) "64,9999"
Your $Website_Xpath looks like an object of DOMXPath. Then the main issue with your code is in the XPath expression: 'div/div[#class="cat_price"]/text()'. You are trying to fetch a div from nowhere. Whether provide full path from the root node (e.g. /html/body/div), or select all divs with // prefix.
Example
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
64,9999<span> - PKR</span>
</div>
</body>
XML;
$doc = new DOMDocument();
$doc->loadXML($xml);
$text = '';
$xpath = new DOMXPath($doc);
// Select all text nodes within a <div> having class="cat_price"
if ($nodes = $xpath->query('//div[#class="cat_price"]/text()')) {
// Search for a node with some content, except spaces
foreach ($nodes as $n) {
if ($text = trim($n->nodeValue))
break;
}
}
var_dump($text);
Output
string(7) "64,9999"

get value of html table row tag using php

Table:
<table class="secondary">
<tr><td>BB:</td><td>112</td></tr>
<tr><td>CC:</td><td>99</td></tr>
<tr><td>DD:</td><td>1</td></tr>
</table>
for example I want to get third row of this table.
I know how to get values from div tag using ID, like:
$doc = new DomDocument();
$doc->loadHTMLFile('http://www.results.com');
$thediv = $doc->getElementById('result');
echo $thediv->textContent;
but how can we get the values from table row tag?
You can use DomXpath http://php.net/manual/en/class.domxpath.php
(Assuming for this example that your table is the first table on the page with class="secondary")
For example:
<?php
$doc = new DomDocument();
$doc->loadHTMLFile("filename.html");
$xpath = new DomXpath($doc);
$row = $xpath->query('//table[#class="secondary"][1]/tr[3]')->item(0);
// Get the html of the third row:
echo $doc->saveHTML($row);
// Get the values from the td's for the third row
foreach ($row->childNodes as $td) {
echo sprintf("nodeName: %s, nodeValue: %s<br>", $td->nodeName, $td->nodeValue);
}

Parsing HTML and removing specific td

I have html content like the following...
<table>
<tr>
<td>xyx...</td>
<td>abc....</td>
<td><span><h3>Downloads</h3></span><br>blah blah blah...</td>
</tr>
<tr>
<td><h3>Downloads</h3>again some content.</td>
<td>dddd</td>
<td>kkkl...</td>
</tr>
</table>
Now am trying to delete 'td's if it has the word 'Downloads' anywhere in the content. After some research on internet I can get something executed and the code is as follows...
$res_text = 'MY HTML';
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//*[text()[contains(.,"Downloads")]]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
This does deletes the word 'Downloads' with its current parent node <span> or <p>, but I wanted the whole <td> should be deleted along with the content.
I tried...
$results = $selector->query('//td[text()[contains(.,"Downloads")]]');
but it's not working. Can some one tell me how can I get it?
You don't need the text() in your query, it should be:
$results = $selector->query('//td[contains(.,"Downloads")]');
The whole code:
$dom = new DOMDocument();
$dom->loadHTML($res_text);
$selector = new DOMXPath($dom);
$results = $selector->query('//td[contains(.,"Downloads")]');
if($results->length){
foreach($results as $res){
$res->parentNode->removeChild($res);
}
}
echo htmlentities($dom->saveHTML());
DEMO

How can I get td values using dom and php

I have a table such this :
<table>
<tr>
<td>Values</td>
<td>5000</td>
<td>6000</td>
</tr>
</table>
And I want to get td's content. But I could not manage it.
<?PHP
$dom = new DOMDocument();
$dom->loadHTML("figures.html");
$table = $dom->getElementsByTagName('table');
$tds=$table->getElementsByTagName('td');
foreach ($tds as $t){
echo $t->nodeValue, "\n";
}
?>
There are multiple problems with this code:
To load from an HTML file, you need to use DOMDocument::loadHTMLFile(), not loadHTML() as you have done. Use $dom->loadHTMLFile("figures.html").
You can't use getElementsByTagName() on a DOMNodeList as you have done (on $table). It can only be used on a DOMDocument.
You could do something like this:
$dom = new DOMDocument();
$dom->loadHTMLFile("figures.html");
$tables = $dom->getElementsByTagName('table');
// Find the correct <table> element you want, and store it in $table
// ...
// Assume you want the first table
$table = $tables->item(0);
foreach ($table->childNodes as $td) {
if ($td->nodeName == 'td') {
echo $td->nodeValue, "\n";
}
}
Alternatively, you could just directly search for all elements with tag name td (though I'm sure you want to do that in a table-specific manner.
You should use a for loop to display the multiple td's with id attributes in it such that each td must signify a different id in html file
for example
for($i=1;$i<=10;$i++){
echo "<td id ='id_".$i."'>".$tdvalue."</td>";
}
and then again you can fetch the td values by just iterating another for loop over getElementById
The td data can be found inside childNodes
$dom = new domDocument;
$dom->loadHTML("your-url");
$tables = $dom->getElementsByTagName('table');
$rows = $tables->getElementsByTagName('tr');
foreach ($rows as $row) {
echo $row->childNodes[0]->nodeValue;
}

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

I would like to cut off some portion of html, I can take it by using XPath and DomDocument but the problem is that I need result as a html code string. Normally I would use reg. expr. for that but I wouldn't like to do a complicated search pattern that would mach the begining and the end of tag.
That's the example input:
some html code before
<div>this <b>is</b> what I want</div>
some html after
and the output:
<div>this <b>is</b> what I want</div>
I tried something like this:
subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div/*");
echo $result->saveHTML();
but i got only error:
Call to undefined method DOMNodeList::saveHTML()
Does anyone know how to get the result as a html string by using DomDocument and XPath?
Thank you Gentleman for pointing out my missunderstanding with accessing methods that are not aviailable in a child object. But line:
echo $doc->saveHTML($result->item(0));
generates only warning (without the html sting I want to have). Luckily I found another soulution and here it is:
<?php
$subject = '<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
//echo $doc->saveHTML($result->item(0));
echo domNodeList_to_string($result);
function domNodeList_to_string($DomNodeList) {
$output = '';
$doc = new DOMDocument;
while ( $node = $DomNodeList->item($i) ) {
// import node
$domNode = $doc->importNode($node, true);
// append node
$doc->appendChild($domNode);
$i++;
}
$output = $doc->saveHTML();
$output = print_r($output, 1);
// I added this because xml output and ajax do not like each others
//$output = htmlspecialchars($output);
return $output;
}
php>
so if one has a query like that:
$result = $xpath->query("//div");
then will get the raw html string output:
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
if the query is:
$result = $xpath->query("//p");
then output will be:
<p style="text-align:right">Written by Kovid Goyal</p><p>A very short ebook to demonstrate the use of XPath.</p><p>This is a truly fascinating chapter.</p><p>A worthy continuation of a fine tradition.</p>
Does anyone know simpler (embeded in php) method to get the same result?
Try this:
$subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0)); //echoes what you want :)
The saveHTML function belongs to the DOMDocument object, you can't call it directly on the node (much less on the NodeList, which is what the query returns), but what you can do is pass it the node as a param.
Also, your query was wrong: what you want is the div element (i.e. //div), not its children (//div/*).
As per the php manual docs on DOMXPath::querydocs, the function:
Returns a DOMNodeList containing all nodes matching the given XPath
expression. Any expression which does not return nodes will return an
empty DOMNodeList.
This means that the $result in the following code will be a DOMNodeListdocs object. So if you want to get individual HTML code out from inside it you'll need to use methods available with a DOMNodeList object. In this case, the item method:
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0));
$result->item(0) returns the first DOMNode in the DOMNodeList created by your xpath query.
Try this :
$subject = 'some html code before<div>this <b>is</b> what I want</div>some html after';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

Categories