How to select text from HTML table using PHP DOM query? - php

How can I get text from HTML table cells using PHP DOM query?
HTML table is:
<table>
<tr>
<th>Job Location:</th>
<td>Kabul
</td>
</tr>
<tr>
<th>Nationality:</th>
<td>Afghan</td>
</tr>
<tr>
<th>Category:</th>
<td>Program</td>
</tr>
</table>
I have following query but it doesn't work:
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($html);
$xmlPageXPath = new DOMXPath($xmlPageDom);
$value = $xmlPageXPath->query('//table td /text()');

get a complete table with php domdocument and print it
The answer is like this:
$html = "<table ID='myid'><tr><td>1</td><td>2</td></tr><tr><td>4</td><td>5</td></tr><tr><td>7</td><td>8</td></tr></table>";
$xml = new DOMDocument();
$xml->validateOnParse = true;
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#id='myid']")->item(0);
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
print $cell->nodeValue;
}
}
EDIT: Use this instead
$table = $xpath->query("//table")->item(0);

Related

Datascraping With PHP

I am trying to take advantage of DOMDocument to scrape a table from another website. I am on shared hosting.
Here is what the html looks like:
<tbody>
<tr class="odd">
<td class="nightclub">Elleven</td>
<td class="city">Downtown Miami</td>
</tr>
<tr class="even">
<td class="night club">Story</td>
<td class="city">South Beach</td>
</tr>
</tbody>
I tried doing:
<?php
$domDoc = new \DOMDocument();
$url = "http://example.com/";
$html = file_get_contents($url);
$domDoc->loadHtml($html);
$domDoc->preserveWhiteSpace = false;
$tables = $domDoc->getElementsByTagName('tbody');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$columns = $row->getElementsByTagName('td');
print $columns->item(0)->nodeValue."/n";
print $columns->item(1)->nodeValue."/n";
print $columns->item(2)->nodeValue;
}
When I do this I get not result. I think the server is blocking my request.
try with simplehtmldom Here
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all tr
foreach($html->find('tr') as $element)
echo $element->innertext . '<br>';
Its good library to parse HTML Manual
What I did was used a open sources PHP packaged called Guzzle. It will even allow you to Crawl into the site you are using.
If you are on shared hosting then download Guzzle and upload it to your server.
github.com/guzzle/guzzle/releases
<?php
require 'vendor/autoload.php';
$client = new GuzzleHttp\Client();
$domDoc = new DOMDocument();
$url = 'http://example.com';
$res = $client->request('GET', $url, [
'auth' => ['user', 'pass']
]);
$html = (string)$res->getBody();
// The # in front of $domDoc will suppress any warnings
$domHtml = #$dom->loadHTML($html);
//discard white space
$domDoc->preserveWhiteSpace = false;
//the table by its tag name
$tables = $domDoc->getElementsByTagName('tbody');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $row)
{
// get each column by tag name
$columns = $row->getElementsByTagName('td');
// echo the values
echo $columns->item(0)->nodeValue.'<br />';
echo $columns->item(1)->nodeValue.'<br />';
echo $columns->item(2)->nodeValue;
}
?>
If you don't mind, this is simplest solution. Use Simple Html Dom like below way:
$html = file_get_html("WWW.YOURDOMAIN.COM");
$data = array();
foreach($html->find("table tr") as $tr){
$row = array();
foreach($tr->find("td") as $td){
/* enter code here */
$row[] = $td->plaintext;
}
$data[] = $row;
}
See detailed answer here.
Your Code is perfect only remove \
$domDoc = new \DOMDocument();
Try
$domDoc = new DOMDocument();

Unable to get both child elements with xpath from xhtml using xquery in php to manipulate

The xhtml data I need to get the childNodes from I don't need the child from the TH childNODES
<table>some data</table>
<table>
<tr>
<td class="c2">PCI Signal Error (SERR#) Enable</td>
<td>Yes</td>
</tr>
<tr>
<td class="c1">Controller Type 1</td>
<td>CISS</td>
</tr>
<tr>
<td class="c2">bus type</td>
<td>CISS</td>
</tr>
<tr>
<th><a name="systempcibus5">PCI Bus 31</a></th>
<td>Device</td>
</tr>
</table>
below is the latest attempt, I only want to get the textContent for the TD's in the above xml
so I can build a mysql statement to insert the data in mySql
I have tried so many variations over the last week.
I get this error. I won't bore you with all the various things I tried, but I believe this is the closest to what I want.
PHP Notice: Trying to get property of non-object in C:\inetpub\wwwroot\reports\gec\test1.php on line 40
<?php
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query('/html/body/table[2]/tr');
//$nodes = $xpath->query("//tr[contains(concat(' ', #class, ' '), ' head ') ");
//header("Content-type: text/plain");
$node_count=$nodes->length ;
for( $i = 1; $i <= intval($node_count); $i++)
{
$node_td1 = $xpath->query('/html/body/table[2]/tr[$i]/td[1]');
$node_td2 = $xpath->query('/html/body/table[2]/tr[$i]/td[2]');
$result1=$node_td1->textContent;
$result2=$node_td2->textContent;
echo $result1 . "," . $result2 . "<br>";
}
Alternatively, you could just point out the row itself, then filter them out using that ->tagName:
$dom = new DomDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DomXPath($dom);
$rows = $xpath->query('/html/body/table[2]/tr');
foreach ($rows as $row) {
foreach($row->childNodes as $col) {
if(isset($col->tagName) && $col->tagName != 'th') {
echo $col->textContent . '<br/>';
}
}
echo '<hr/>';
}
Or with using xpath, to reference each row:
foreach ($rows as $row) {
$col1 = $xpath->evaluate('string(./td[1])', $row);
$col2 = $xpath->evaluate('string(./td[2])', $row);
echo $col1 . '<br/>';
echo $col2 . '<br/>';
echo '<hr/>';
}
Sample Output

Parse html table data from DOMXPath

I am scraping data from an external html table that is 100 rows by 3 columns. I want to parse the data into a 10x10 table where the data from each row is combined. Ex:
<tr>
<td>info1</td>
<td>info2</td>
<td>info3</td>
</tr>
<tr>
<td>info4</td>
<td>info5</td>
<td>info6</td>
</tr>
<tr>
<td>info7</td>
<td>info8</td>
<td>info9</td>
</tr>
...and so on
into
<tr>
<td>info1<br/>info2<br/>info3</td>
<td>info4<br/>info5<br/>info6</td>
<td>info7<br/>info8<br/>info9</td>
...7 more times
</tr>
...9 more times
I can output the data into a single column by using line breaks. I have absolutely no idea to do what I want to do above. Also I want to be able to style the data using css. Any help/direction is appreciated. Here is my code:
$doc = new DOMDocument();
$doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
xpath = new DOMXPath($doc);
$table = $xpath->query('//table[#id="idTable"]')->item(0);
$rows = $table->getElementsByTagName("tr");
foreach($rows as $row)
{
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) print $cell->nodeValue . "<br/>";
}
Two (similar) ways you can do this:
1) By counting the <tr>s and combine each 10 of them, disregard its <td> number:
$doc=new DOMDocument();
$doc->loadHTML($html);
$xpath=new DOMXPath($doc);
echo "<table>\n";
/* 10 is the row count */
for($i=0;$i<10;$i++)
{
echo "<tr>\n";
/* 10 is the column count */
foreach($xpath->query('//table[#id="myTable"]/tr[position()>'.($i*10).' and position()<'.(($i+1)*10+1).']') as $tr)
{
echo "\t<td>";// "\t" to make it look nice
$tds=array();
foreach($tr->childNodes as $td)
{
if($td->nodeName!="td") continue;
$tds[]=$td->firstChild->nodeValue;
}
echo implode("<br />",$tds);
echo "</td>\n";
}
echo "</tr>\n";
}
echo "</table>";
Online demo
2) By counting the <td>s and combine each 3 of them into a new <td>, combine each 30 of them into a new <tr>, disregard the <tr>s:
$doc=new DOMDocument();
$doc->loadHTML($html);
$xpath=new DOMXPath($doc);
echo "<table>\n";
$i=0;
$tds=array();
foreach($xpath->query('//table[#id="myTable"]/tr/td/text()') as $td)
{
/* 30 is each row's old-cell-count */
if($i%30==0) echo "<tr>\n";
$tds[]=$td->nodeValue;
/* 3 is each cell's old-cell-count */
if($i%3==2)
{
echo "\t<td>".implode("<br />",$tds)."</td>\n";
$tds=array();
}
if($i%30==29) echo "</tr>\n";
$i++;
}
echo "</table>";
Online demo
Both outputs:
<table>
<tr>
<td>info0.1<br />info0.2<br />info0.3</td>
<td>info1.1<br />info1.2<br />info1.3</td>
<td>info2.1<br />info2.2<br />info2.3</td>
<td>info3.1<br />info3.2<br />info3.3</td>
<td>info4.1<br />info4.2<br />info4.3</td>
<td>info5.1<br />info5.2<br />info5.3</td>
<td>info6.1<br />info6.2<br />info6.3</td>
<td>info7.1<br />info7.2<br />info7.3</td>
<td>info8.1<br />info8.2<br />info8.3</td>
<td>info9.1<br />info9.2<br />info9.3</td>
</tr>
<tr>
<td>info10.1<br />info10.2<br />info10.3</td>
<td>info11.1<br />info11.2<br />info11.3</td>
<!-- ... -->
<td>info97.1<br />info97.2<br />info97.3</td>
<td>info98.1<br />info98.2<br />info98.3</td>
<td>info99.1<br />info99.2<br />info99.3</td>
</tr>
</table>

Preserving <br> tags when parsing HTML text content

I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}

Getting element length with php DOM

I'm stuck with this.
I try to use php dom to parse some html code.
How can I get to know how many children current element has witch I iterate through in for loop?
<?php
$str='
<table id="tableId">
<tr>
<td>row1 cell1</td>
<td>row1 cell2</td>
</tr>
<tr>
<td>row2 cell1</td>
<td>row2 cell2</td>
</tr>
</table>
';
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId'); // getting the table that I need
$DOM->loadHTML($table);
$tr = $DOM->getElementsByTagName('tr'); // getting rows
echo $tr->item(0)->nodeValue; // outputs row1 cell1 row1 cell2 - exactly as I expect with both rows
echo "<br>";
echo $tr->item(1)->nodeValue; // outputs row2 cell1 row2 cell2
// now I need to iterate through each row to build an array with cells that it has
for ($i = 0; $i < $tr->length; $i++)
{
echo $tr->item($i)->length; // outputs no value. But how can I get it?
echo $i."<br />";
}
?>
This will give you all childnodes:
$tr->item($i)->childNodes->length;
... but: it will contain DOMText nodes with whitespace etc (so the count is 4). If you don't necessarily need the length, just want to iterate over all the nodes, you can do this:
foreach($tr->item($i)->childNodes as $node){
if($node instanceof DOMElement){
var_dump($node->ownerDocument->saveXML($node));
}
}
If you need only a length of elements, you can do this:
$x = new DOMXPath($DOM);
var_dump($x->evaluate('count(*)',$tr->item($i)));
And you can do this:
foreach($x->query('*',$tr->item($i)) as $child){
var_dump($child->nodeValue);
}
foreach-ing through the ->childNodes has my preference for simple 'array-building'. Keep in mind you van just foreach through DOMNodeList's as if they were arrays, saves a lot of hassle.
Building a simple array from a table:
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId');
$result = array();
foreach($table->childNodes as $row){
if(strtolower($row->tagName) != 'tr') continue;
$rowdata = array();
foreach($row->childNodes as $cell){
if(strtolower($cell->tagName) != 'td') continue;
$rowdata[] = $cell->textContent;
}
$result[] = $rowdata;
}
var_dump($result);

Categories