Convert HTML table to PHP array - problem with merge text - php

I have a little problem. I must convert in PHP from table html to array or json. My array always have two columns and N rows. I use:
$xml = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$xml->loadHTML('<?xml encoding="utf-8" ?>'.$content);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#class='".$autoAttributeHtmlClass."']");
$length = $table->length;
$j = 0;
$attrArr = array();
for ($i=0; $i <= $length-1; $i++) {
$element = $table->item($i);
$rows = $element->getElementsByTagName("tr");
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
$attrArr[$j]['attr'] = rtrim($cols->item(0)->nodeValue, ':');
$attrArr[$j]['val'] = htmlspecialchars($cols->item(1)->nodeValue);
$j++;
}
}
echo json_encode($attrArr);
All is good until in column is only clear text. When in column is additional html code (for example: <div>, <span>, <p>, <li>, etc.) inner texts are merge
Example HTML table:
<table class="test">
<tbody>
<tr>
<td>Col1</td>
<td>Micro Tower</div></td>
</tr>
<tr>
<td>Col2</td>
<td>
<p>Micro-ATX</p>
<p>Mini-ITX</p>
</td>
</tr>
<tr>
<td>Col3</td>
<td>
<div>
<span>Test1</span>
</div>
<div>
<span>Test2</span>
</div>
</td>
</tr>
</tbody>
</table>
In case of secound row in nodeValue (PHP) I have a merge: Micro-ATXMini-ITX
In third row in nodeValue (PHP) I have a merge: Test1Test2
Any idea? I must have a separator in between text - now is not readable (space, coma or semicolon)

Try .textContent insted of .nodeValue

Related

I have to display image and data from xml, how can I do it in php?

Each time it loops, the text that it shows only the Product_URL. I really confuse how to solve this problem. I guess there is something wrong with the loop.
<html>
<head>
<title>Display main Image</title>
</head>
<body>
<table>
<tr>
<th>Thumbnail Image</th>
<th>Product Name</th>
<th>Product Description</th>
<th>Price</th>
<th>Weight</th>
<th>Avail</th>
<th>Product URL</th>
</tr>
<tr>
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->Load('xml_feeds7.xml');
$xpath = new DOMXPath($doc);
$listquery = array('//item/thumbnail_url', '//item/productname', '//item/productdesciption', '//item/price', '//item/weight', '//item/avail', '//item/product_url');
foreach ($listquery as $queries) {
$entries = $xpath->query($queries);
foreach ($entries as $entry) { ?>
<tr>
<td>
<img src="<?php echo $entry->nodeValue; ?>" width="100px" height="100px">
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php
$price_value = $entry->nodeValue;
echo str_replace($price_value, ".00", "");
?>
</td>
<td>
<?php
$weight_value = $entry->nodeValue;
echo str_replace($weight_value, ".00", "");
?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
</tr>
}
}
</tr>
</table>
</body>
</html>
The table should be displaying:
---------------------------------------------------------------------------------
| Thumbnail | Product Name | Description | Price | Weight | Avail | Product_URL |
---------------------------------------------------------------------------------
Xpath can return scalar values (strings and numbers) directly, but you have to do the typecast in the Expression and use DOMxpath::evaluate().
You should iterate the items and then use the item as a context for the detail data expressions. Building separate lists can result in invalid data (if an element in on of the items is missing).
Last you can use DOM methods to create the HTML table. That way it will take care of escaping and closing the tags.
$xml = <<<'XML'
<items>
<item>
<thumbnail_url>image.png</thumbnail_url>
<productname>A name</productname>
<productdescription>Some text</productdescription>
<price currency="USD">42.21</price>
<weight unit="g">23</weight>
<avail>10</avail>
<product_url>page.html</product_url>
</item>
</items>
XML;
$document = new DOMDocument;
$document->preserveWhiteSpace = false;
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$fields = [
'Thumbnail' => 'string(thumbnail_url)',
'Product Name' => 'string(productname)',
'Description' => 'string(productdescription)',
'Price' => 'number(price)',
'Weight' => 'number(weight)',
'Availability' => 'string(avail)',
'Product_URL' => 'string(product_url)'
];
$html = new DOMDocument();
$table = $html->appendChild($html->createElement('table'));
$row = $table->appendChild($html->createElement('tr'));
// add table header cells
foreach ($fields as $caption => $expression) {
$row
->appendChild($html->createElement('th'))
->appendChild($html->createTextNode($caption));
}
// iterate the items in the XML
foreach ($xpath->evaluate('//item') as $item) {
// add a new table row
$row = $table->appendChild($html->createElement('tr'));
// iterate the field definitions
foreach ($fields as $caption => $expression) {
// fetch the value using the expression in the item context
$value = $xpath->evaluate($expression, $item);
switch ($caption) {
case 'Thumbnail':
// special handling for the thumbnail field
$image = $row
->appendChild($html->createElement('td'))
->appendChild($html->createElement('img'));
$image->setAttribute('src', $value);
break;
case 'Price':
case 'Weight':
// number format for price and weight values
$row
->appendChild($html->createElement('td'))
->appendChild(
$html->createTextNode(
number_format($value, 2, '.')
)
);
break;
default:
$row
->appendChild($html->createElement('td'))
->appendChild($html->createTextNode($value));
}
}
}
$html->formatOutput = TRUE;
echo $html->saveHtml();
Output:
<table>
<tr>
<th>Thumbnail</th>
<th>Product Name</th>
<th>Description</th>
<th>Price</th>
<th>Weight</th>
<th>Availability</th>
<th>Product_URL</th>
</tr>
<tr>
<td><img src="image.png"></td>
<td>A name</td>
<td>Some text</td>
<td>42.21</td>
<td>23.00</td>
<td>10</td>
<td>page.html</td>
</tr>
</table>
I've changed it to use SimpleXML as this is a fairly simple data structure - but this fetches each <item> and then displays the values from there. I've only done this with a few values, but hopefully this shows the idea...
$doc = simplexml_load_file('xml_feeds7.xml');
foreach ( $doc->xpath("//item") as $item ) {
echo "<tr>";
echo "<td><img src=\"{$item->thumbnail_url}\" width=\"100px\" height=\"100px\"></td>";
echo "<td>{$item->productname}</td>";
echo "<td>{$item->productdesciption}</td>";
// Other fields...
$price_value = str_replace(".00", "",(string)$item->price);
echo "<td>{$price_value}</td>";
// Other fields...
echo "</tr>";
}
Rather than use XPath for each value, it uses $item->elementName, so $item->productname is the productname. A much simpler way of referring to each field.
Note that with the price field, as you are processing it further - you have to cast it to a string to ensure it will process correctly.
Update:
If you need to access data in a namespace in SimpleXML, you can use XPath, or in this case there is a simple (bit roundabout way). Using the ->children() method you can pass the namespace of the elements you want, this will then give you a new SimpleXMLElement with all the elements for that namespace.
$extraData = $item->children('g',true);
echo "<td>{$extraData->productname}</td>";
Now - $extraData will have any element with g as the namespace prefix, and they can be referred to in the same way as before, but instead of $item you use $extraData.

Extract text and image src with PHP DomDocument

I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.
You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)

How can I parse a website to get the links out of a table?

I am trying to figure out how to parse a website to get the links out of a table. In my particular case there are two tables, but I only want the links from the second table (Link5 & Link6). Here is the HTML I am trying to parse.
<html>
<head>
</head>
<body>
Link1<br>
<br>
<table>
<tbody>
<tr>
<td>Link2</td>
<td>dog</td>
<td>fish</td>
</tr>
<tr>
<td>Link3</td>
<td>cat</td>
<td>bird</td>
</tr>
</tbody>
</table>
<br>
Link4<br>
<br>
<table>
<tbody>
<tr>
<td>Link5</td>
<td>cow</td>
</tr>
<tr>
<td>Link6</td>
<td>horse</td>
</tr>
</tbody>
</table>
<br>
Link7<br>
</body>
</html>
I have read that DOM is a good way to parse data from the web, so here is the code I have been working on.
<?php
$link = array();
//new dom object
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile('http://www.example.com');
//discard white space
$dom->preserveWhiteSpace = false;
//get the table by its tag name
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(1)->getElementsByTagName('tr');
$i = 0;
//loop over the table rows
foreach ($rows as $row)
{
$links = $row->getElementsByTagName('a');
//put node value into an array
$link[] = $links->item(0)->nodeValue;
// echo the values
echo $link[$i] . '<br />';
$i++;
}
?>
This code gives the following output:
Link5
Link6
But what I would like to achieve is:
http://www.example.com/link5.html
http://www.example.com/link6.html
Any help would be greatly appreciated.
I guess the problem is you want to get the href not the node's value. So you should use getAttribute
$link[] = $links->item(0)->getAttribute("href");

PHP textContent removing HTML?

I have the following script which loops through a HTML table and gets the values from it then returns the value of the table in a td.
$tds = $dom->getElementsByTagName('td');
// New dom
$dom2 = new DOMDocument;
$x = 1;
// Loop through all the tds printing the value with a new class
foreach($tds as $t) {
if($x%2 == 1)
print "</tr><tr>";
$class = ($x%2 == 1) ? "odd" : "even";
var_dump($t->textContent);
print "<td class='$class'>".$t->textContent."</td>";
$x++;
}
But the textContent seems to be stripping the HTML tags (for example it is a <p></p> wrapper tag). How can I get it to just give me the value?
Or is there another way of doing this? I have the following html
<table>
<tr>
<td>q1</td>
<td>a1</td>
</tr>
<tr>
<td>q2</td>
<td>a2</td>
</tr>
</table>
and I need to make it look like
<table>
<tr>
<td class="odd">q1</td>
<td class="even">a1</td>
</tr>
<tr>
<td class="odd">q2</td>
<td class="even">a2</td>
</tr>
</table>
It will always look the exact same way (minus extra element rows and the values which change).
Any help?
According to MDN this is the expected behaviour of textContent.
You can just add the class to the tds in the DomDocument
$tds = $dom->getElementsByTagName('td');
$x = 1;
foreach($tds as $td) {
if($x%2 == 1){
$td->setAttribute('class', 'odd');
}
else{
$td->setAttribute('class', 'even');
}
$x++;
}

php DomXPath - how to get image in current node only and not in child nodes?

i need to get only image that in current node and not in child nodes
i want to get only green/yellow/red/black images without not_important.gif image
i can use query './/table/tr/td/img'
but i need it inside loop
<?php
/////////////////////////////////////////////////////////////////////
$html='
<table>
<tr>
<td colspan="2">
<span>
<img src="not_important.gif" />
</span>
<img src="green.gif" />
</td>
</tr>
<tr>
<td>
<span>yellow</span>
<img src="yellow.gif" />
</td>
<td>
<span>red</span>
<img src="red.gif" />
</td>
</tr>
</table>
<table>
<tr>
<td>
<span>
<img src="not_important.gif" />
</span>
<img src="black.gif" />
</td>
</tr>
</table>
';
/////////////////////////////////////////////////////////////////////
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
/////////////////////////////////////////////////////////////////////
$query = $xpath->query('.//table/tr/td');
for( $x=0,$results=''; $x<$query->length; $x++ )
{
$x1=$x+1;
$image = $query->item($x)->getELementsByTagName('img')->item(0)->getAttribute('src');
$results .= "image $x1 is : $image<br/>";
}
echo $results;
/////////////////////////////////////////////////////////////////////
?>
can i do it through $query->item()->
i tried has_attributes and getElementsByTagNameNS and getElementById
but i failed ::
Replace:
$image = $query->item($x)->getELementsByTagName('img')->item(0)->getAttribute('src');
...with:
$td = $query->item($x); // grab the td element
$img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element
$image = $img->getAttribute('src'); // grab the source of the image
In other words, use the XPath object again to query, but now for ./img, relative to the context node you provide as the second argument to query(). The context node being one of the elements (td) of the earlier result.
The query //table/tr/td/img should work just fine as the unwanted images all reside in <span> elements.
Your loop would look like
$images = $xpath->query('//table/tr/td/img');
$results = '';
for ($i = 0; $i < $images->length; $i++) {
$results .= sprintf('image %d is: %s<br />',
$i + 1,
$images->item($i)->getAttribute('src'));
}

Categories