I am trying to parse an HTML table with DOM and it works fine but when some cell contains html it doesn't work properly.
Here is the Sample HTML Table
<tr>
<td>Razon Social: </td>
<td>Circulo Inmobiliaria Sur (Casa Central)</td>
</tr>
<tr>
<td>Email: </td>
<td> <img src="generateImage.php?email=myemail#domain.com"/> </td>
</tr>
And PHP Code:
$rows = $dom->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cells = $row->getElementsByTagName('td');
if(strpos($cells->item(0)->textContent, "Razon") > 0)
{
$_razonSocial = $cells->item(1)->textContent;
}
else if(strpos($cells->item(0)->textContent, "Email") > 0)
{
$_email = $cells->item(1)->textContent;
}
}
echo "Razon Social: $_razonSocial<br>Email: $_email";
OUTPUT:
Razon Social: Circulo Inmobiliaria Sur (Casa Central)
Email:
Email is empty, it must be:
<img src="generateImage.php?email=myemail#domain.com"/>
I have even tried
$cells->item(1)->nodeValue;
instead of
$cells->item(1)->textContent;
But that too doesn't work. How I can make it return HTML value?
Give id to your table as item_specification
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$table = $x->query("//*[#id='item_specification']/tr");
$rows = $table;
foreach ($rows as $row) {
$atr_name = $row -> getElementsByTagName('td')->item(0)->nodeValue;
$atr_val = $row -> getElementsByTagName('td')->item(1)->nodeValue;
}
echo " {$atr_name} - {$atr_val} <br \>";
Its working fine.
As I already mentioned, <img src="generateImage.php?email=myemail#domain.com"/> is not a text. It's another html-entity. So try this:
if(strpos($cells->item(0)->textContent, "Razon") !== false) {
$_razonSocial = $cells->item(1)->textContent;
} else if(strpos($cells->item(0)->textContent, "Email") !== false) {
$count = 0;
// here we get all child nodes of td.
// space before img-tag is also a child node, but it has type DOMText
// so we skip it.
foreach ($cells->item(1)->childNodes as $child) {
if (++$count == 2)
$_email = $child->getAttribute('src');
}
// now in $_email you have full src value and can somehow extract email
}
Related
How can I get text from HTML table cells using PHP DOM query?
HTML table is:
<table>
<tr>
<th>Job Location:</th>
<td>Kabul
</td>
</tr>
<tr>
<th>Nationality:</th>
<td>Afghan</td>
</tr>
<tr>
<th>Category:</th>
<td>Program</td>
</tr>
</table>
I have following query but it doesn't work:
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($html);
$xmlPageXPath = new DOMXPath($xmlPageDom);
$value = $xmlPageXPath->query('//table td /text()');
get a complete table with php domdocument and print it
The answer is like this:
$html = "<table ID='myid'><tr><td>1</td><td>2</td></tr><tr><td>4</td><td>5</td></tr><tr><td>7</td><td>8</td></tr></table>";
$xml = new DOMDocument();
$xml->validateOnParse = true;
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#id='myid']")->item(0);
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
print $cell->nodeValue;
}
}
EDIT: Use this instead
$table = $xpath->query("//table")->item(0);
I have a table with 3 columns where each of the columns could contain a link or data like this one:
<tr><td><a href='link1'>value1</a></td><td><a href='link2'>value2</a></td><td><a href='link3'>value3</a></td></tr>
<tr><td><a href='link4'>value4</a></td><td>value5</td><td>value6</td></tr>
<tr><td>value7</td><td><a href='link8'>value8</a></td><td>value9</td></tr>
<tr><td>value10</td><td>value11</td><td><a href='link12'>value12</a></td></tr>
<tr><td>value13</td><td>value14</td><td>value15</td></tr>
I am able to get the data for each cell of the table using the following code:
$data = file_get_contents('pathtomyfile');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo $col->nodeValue;
}
echo "\n";
}
I am trying to output the table in a different format and am wondering how I can get the value of the href in addition to the value of the table cell for the cells where a link exists. For example, for the first table cell I'd like to get "link1" and "value1".
Alternatively, you could check inside the inner loop (the one that iterates each cols) whether a link exists inside it (since some of them don't have it):
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col) {
echo 'value = ' . $col->nodeValue;
if($xpath->evaluate('count(./a)', $col) > 0) { // check if an anchor exists
echo ' | link = ' . $xpath->evaluate('string(./a/#href)', $col); // if there is, then echo the href value
}
echo '<br/>';
}
echo "<br/>";
}
Sample Output
I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?
This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}
I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)
try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')
I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
I'm stuck with this.
I try to use php dom to parse some html code.
How can I get to know how many children current element has witch I iterate through in for loop?
<?php
$str='
<table id="tableId">
<tr>
<td>row1 cell1</td>
<td>row1 cell2</td>
</tr>
<tr>
<td>row2 cell1</td>
<td>row2 cell2</td>
</tr>
</table>
';
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId'); // getting the table that I need
$DOM->loadHTML($table);
$tr = $DOM->getElementsByTagName('tr'); // getting rows
echo $tr->item(0)->nodeValue; // outputs row1 cell1 row1 cell2 - exactly as I expect with both rows
echo "<br>";
echo $tr->item(1)->nodeValue; // outputs row2 cell1 row2 cell2
// now I need to iterate through each row to build an array with cells that it has
for ($i = 0; $i < $tr->length; $i++)
{
echo $tr->item($i)->length; // outputs no value. But how can I get it?
echo $i."<br />";
}
?>
This will give you all childnodes:
$tr->item($i)->childNodes->length;
... but: it will contain DOMText nodes with whitespace etc (so the count is 4). If you don't necessarily need the length, just want to iterate over all the nodes, you can do this:
foreach($tr->item($i)->childNodes as $node){
if($node instanceof DOMElement){
var_dump($node->ownerDocument->saveXML($node));
}
}
If you need only a length of elements, you can do this:
$x = new DOMXPath($DOM);
var_dump($x->evaluate('count(*)',$tr->item($i)));
And you can do this:
foreach($x->query('*',$tr->item($i)) as $child){
var_dump($child->nodeValue);
}
foreach-ing through the ->childNodes has my preference for simple 'array-building'. Keep in mind you van just foreach through DOMNodeList's as if they were arrays, saves a lot of hassle.
Building a simple array from a table:
$DOM = new DOMDocument;
$DOM->loadHTML($str); // loading page contents
$table = $DOM->getElementById('tableId');
$result = array();
foreach($table->childNodes as $row){
if(strtolower($row->tagName) != 'tr') continue;
$rowdata = array();
foreach($row->childNodes as $cell){
if(strtolower($cell->tagName) != 'td') continue;
$rowdata[] = $cell->textContent;
}
$result[] = $rowdata;
}
var_dump($result);