Getting element length with php DOM - php

I'm stuck with this.
I try to use php dom to parse some html code.
How can I get to know how many children current element has witch I iterate through in for loop?
<?php
$str='
<table id="tableId">
<tr>
<td>row1 cell1</td>
<td>row1 cell2</td>
</tr>
<tr>
<td>row2 cell1</td>
<td>row2 cell2</td>
</tr>
</table>
';
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId'); // getting the table that I need
$DOM->loadHTML($table);
$tr = $DOM->getElementsByTagName('tr'); // getting rows
echo $tr->item(0)->nodeValue; // outputs row1 cell1 row1 cell2 - exactly as I expect with both rows
echo "<br>";
echo $tr->item(1)->nodeValue; // outputs row2 cell1 row2 cell2
// now I need to iterate through each row to build an array with cells that it has
for ($i = 0; $i < $tr->length; $i++)
{
echo $tr->item($i)->length; // outputs no value. But how can I get it?
echo $i."<br />";
}
?>

This will give you all childnodes:
$tr->item($i)->childNodes->length;
... but: it will contain DOMText nodes with whitespace etc (so the count is 4). If you don't necessarily need the length, just want to iterate over all the nodes, you can do this:
foreach($tr->item($i)->childNodes as $node){
if($node instanceof DOMElement){
var_dump($node->ownerDocument->saveXML($node));
}
}
If you need only a length of elements, you can do this:
$x = new DOMXPath($DOM);
var_dump($x->evaluate('count(*)',$tr->item($i)));
And you can do this:
foreach($x->query('*',$tr->item($i)) as $child){
var_dump($child->nodeValue);
}
foreach-ing through the ->childNodes has my preference for simple 'array-building'. Keep in mind you van just foreach through DOMNodeList's as if they were arrays, saves a lot of hassle.
Building a simple array from a table:
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId');
$result = array();
foreach($table->childNodes as $row){
if(strtolower($row->tagName) != 'tr') continue;
$rowdata = array();
foreach($row->childNodes as $cell){
if(strtolower($cell->tagName) != 'td') continue;
$rowdata[] = $cell->textContent;
}
$result[] = $rowdata;
}
var_dump($result);

Related

PHP - Extract a cell value of a table with a match expression

I want to extract the value of a specific cell from a table in a web page. First I search a string (here a player's name) and after I wan't to get the value of the <td> cell associated (here 94).
I can connect to the web page, find the table with is id and get all values. I also can search a specific string with preg_match but I can't extract the value of the <td> cell.
What the best way to extract the value of a table with a match expression ?
Here is my script :
<?php
// Connect to the web page
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
// Extract the table from is id
$table = $xpath->query("//*[#id='nba']")->item(0);
// See result in HTML
//$tableResult = $doc->saveHTML($table);
//print $tableResult;
// Get elements by tags and build a string
$str = "";
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
$str .= $cell->nodeValue;
}
}
// Search a specific string (here a player's name)
$player = preg_match('/LeBron James(.*)/', $str, $matches);
// Get the value
$playerValue = intval(array_pop($matches));
print $playerValue;
?>
Here is the HTML structure of the table :
<table id="nba">
<thead><tr><th>Rank</th><th>Player</th><th>Trp Dbl</th></tr></thead>
...
<tr>
<td>5.</td>
<td><strong>LeBron James</strong></td>
<td>94</td>
</tr>
...
</table>
DOM manipulation solution.
Search over all cells and break if cell consists LeBron James value.
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTMLFile('https://www.basketball-reference.com/leaders/trp_dbl_career.html');
$xpath = new DOMXPath($doc);
$table = $xpath->query("//*[#id='nba']")->item(0);
$str = "";
$rows = $table->getElementsByTagName("tr");
$trpDbl = null;
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
foreach ($cells as $cell) {
if (preg_match('/LeBron James/', $cell->nodeValue, $matches)) {
$trpDbl = $cell->nextSibling->nodeValue;
break;
}
}
}
print($trpDbl);
Regex expression for whole cell value with name LeBron James.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>/', $str, $matches);
If you want to capture also ID 94 from next cell you can use this expression.
$player = preg_match('/<td>(.*LeBron James.*)<\/td>\s*<td>(.*)<\/td>/', $str, $matches);
It returns two groups, first cell with player's name and second with ID.

Parse html table data from DOMXPath

I am scraping data from an external html table that is 100 rows by 3 columns. I want to parse the data into a 10x10 table where the data from each row is combined. Ex:
<tr>
<td>info1</td>
<td>info2</td>
<td>info3</td>
</tr>
<tr>
<td>info4</td>
<td>info5</td>
<td>info6</td>
</tr>
<tr>
<td>info7</td>
<td>info8</td>
<td>info9</td>
</tr>
...and so on
into
<tr>
<td>info1<br/>info2<br/>info3</td>
<td>info4<br/>info5<br/>info6</td>
<td>info7<br/>info8<br/>info9</td>
...7 more times
</tr>
...9 more times
I can output the data into a single column by using line breaks. I have absolutely no idea to do what I want to do above. Also I want to be able to style the data using css. Any help/direction is appreciated. Here is my code:
$doc = new DOMDocument();
$doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
xpath = new DOMXPath($doc);
$table = $xpath->query('//table[#id="idTable"]')->item(0);
$rows = $table->getElementsByTagName("tr");
foreach($rows as $row)
{
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) print $cell->nodeValue . "<br/>";
}
Two (similar) ways you can do this:
1) By counting the <tr>s and combine each 10 of them, disregard its <td> number:
$doc=new DOMDocument();
$doc->loadHTML($html);
$xpath=new DOMXPath($doc);
echo "<table>\n";
/* 10 is the row count */
for($i=0;$i<10;$i++)
{
echo "<tr>\n";
/* 10 is the column count */
foreach($xpath->query('//table[#id="myTable"]/tr[position()>'.($i*10).' and position()<'.(($i+1)*10+1).']') as $tr)
{
echo "\t<td>";// "\t" to make it look nice
$tds=array();
foreach($tr->childNodes as $td)
{
if($td->nodeName!="td") continue;
$tds[]=$td->firstChild->nodeValue;
}
echo implode("<br />",$tds);
echo "</td>\n";
}
echo "</tr>\n";
}
echo "</table>";
Online demo
2) By counting the <td>s and combine each 3 of them into a new <td>, combine each 30 of them into a new <tr>, disregard the <tr>s:
$doc=new DOMDocument();
$doc->loadHTML($html);
$xpath=new DOMXPath($doc);
echo "<table>\n";
$i=0;
$tds=array();
foreach($xpath->query('//table[#id="myTable"]/tr/td/text()') as $td)
{
/* 30 is each row's old-cell-count */
if($i%30==0) echo "<tr>\n";
$tds[]=$td->nodeValue;
/* 3 is each cell's old-cell-count */
if($i%3==2)
{
echo "\t<td>".implode("<br />",$tds)."</td>\n";
$tds=array();
}
if($i%30==29) echo "</tr>\n";
$i++;
}
echo "</table>";
Online demo
Both outputs:
<table>
<tr>
<td>info0.1<br />info0.2<br />info0.3</td>
<td>info1.1<br />info1.2<br />info1.3</td>
<td>info2.1<br />info2.2<br />info2.3</td>
<td>info3.1<br />info3.2<br />info3.3</td>
<td>info4.1<br />info4.2<br />info4.3</td>
<td>info5.1<br />info5.2<br />info5.3</td>
<td>info6.1<br />info6.2<br />info6.3</td>
<td>info7.1<br />info7.2<br />info7.3</td>
<td>info8.1<br />info8.2<br />info8.3</td>
<td>info9.1<br />info9.2<br />info9.3</td>
</tr>
<tr>
<td>info10.1<br />info10.2<br />info10.3</td>
<td>info11.1<br />info11.2<br />info11.3</td>
<!-- ... -->
<td>info97.1<br />info97.2<br />info97.3</td>
<td>info98.1<br />info98.2<br />info98.3</td>
<td>info99.1<br />info99.2<br />info99.3</td>
</tr>
</table>

PHP DOM nodeValue not doesn't work

I am trying to parse an HTML table with DOM and it works fine but when some cell contains html it doesn't work properly.
Here is the Sample HTML Table
<tr>
<td>Razon Social: </td>
<td>Circulo Inmobiliaria Sur (Casa Central)</td>
</tr>
<tr>
<td>Email: </td>
<td> <img src="generateImage.php?email=myemail#domain.com"/> </td>
</tr>
And PHP Code:
$rows = $dom->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cells = $row->getElementsByTagName('td');
if(strpos($cells->item(0)->textContent, "Razon") > 0)
{
$_razonSocial = $cells->item(1)->textContent;
}
else if(strpos($cells->item(0)->textContent, "Email") > 0)
{
$_email = $cells->item(1)->textContent;
}
}
echo "Razon Social: $_razonSocial<br>Email: $_email";
OUTPUT:
Razon Social: Circulo Inmobiliaria Sur (Casa Central)
Email:
Email is empty, it must be:
<img src="generateImage.php?email=myemail#domain.com"/>
I have even tried
$cells->item(1)->nodeValue;
instead of
$cells->item(1)->textContent;
But that too doesn't work. How I can make it return HTML value?
Give id to your table as item_specification
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$table = $x->query("//*[#id='item_specification']/tr");
$rows = $table;
foreach ($rows as $row) {
$atr_name = $row -> getElementsByTagName('td')->item(0)->nodeValue;
$atr_val = $row -> getElementsByTagName('td')->item(1)->nodeValue;
}
echo " {$atr_name} - {$atr_val} <br \>";
Its working fine.
As I already mentioned, <img src="generateImage.php?email=myemail#domain.com"/> is not a text. It's another html-entity. So try this:
if(strpos($cells->item(0)->textContent, "Razon") !== false) {
$_razonSocial = $cells->item(1)->textContent;
} else if(strpos($cells->item(0)->textContent, "Email") !== false) {
$count = 0;
// here we get all child nodes of td.
// space before img-tag is also a child node, but it has type DOMText
// so we skip it.
foreach ($cells->item(1)->childNodes as $child) {
if (++$count == 2)
$_email = $child->getAttribute('src');
}
// now in $_email you have full src value and can somehow extract email
}

Preserving <br> tags when parsing HTML text content

I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}

convert a nodevalue into a string

Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.
Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"
Use DOM::saveXML or DOM::saveHTML to convert node value to string.
did you try #$cols->item(0)->textContent

Categories