Convert HTML table to XML using PHP (DOMDocument?)

Convert HTML table to XML using PHP (DOMDocument?) - php

I'm looking to convert the below HTML Table markup into an XML format.
<table class='tbl-class'>
<thead>
<tr>
<th>Island</th>
<th>Number of nights</th>
</tr>
</thead>
<tbody>
<tr>
<td>Guadeloupe</td>
<td>1</td>
</tr>
<tr>
<td>Antigua</td>
<td>5</td>
</tr>
<tbody>
</table>
I would ideally like the XML output to be something like this:
<location>
<island>Guadeloupe</island>
<nights>1</nights>
</location>
<location>
<island>Antigua</island>
<nights>5</nights>
</location>
I'm currently attempting to use DOMDocument to do this but have little experience with it to get anywhere. So far i've done the following: - I think there's much more i need to be doing in the foreach loop but unsure what..
$doc = new DOMDocument();
$doc->load($convertedString);
$classname = 'tbl-class';
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$convertedString = $doc->saveHTML();

I find that using SimpleXML is as it's name implies - simpler. This code reads the XML and as you have - finds the <table> element.
Then using foreach() it uses SimpleXML's ability to refer to the element hierarchy as objects, so $table[0]->tbody->tr refers to the <tr> elements in the <tbody> section of the table.
It then combines each of the <td> elements with the corresponding label from $headers...
$xml= simplexml_load_string($convertedString);
$classname = 'tbl-class';
$table = $xml->xpath("//*[contains(#class, '$classname')]");
$headers = ["island", "nights"];
$out = new SimpleXMLElement("<locations />");
foreach ( $table[0]->tbody->tr as $tr ){
$location = $out->addChild("location");
$key = 0;
foreach ( $tr->td as $td ) {
$location->addChild($headers[$key++], (string)$td);
}
}
echo $out->asXML();

Related

How to get child element of DOMDocument

Trying to get the specific value from a table with tr and td elements...
HTML:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>
PHP:
$html = 'http://www.example.com'; // edited
$dom = new DOMDocument;
#$dom->loadHTML($html);
$data = $dom->getElementsByTagName('tr:nth-child(3n)');
foreach ($data as $datas){
echo $link->nodeValue;
}
Using such or different approach, how to get the value of specific td element... ?

Using getElementsByTagName() returns a list of the tags based on your starting point, so once you've found the table, you can then use the same function to get the <td> tags. You can then just pick out the elements your after...
$data = "<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>";
$dom = new DOMDocument;
$dom->loadHTML($data);
$table = $dom->getElementsByTagName('table');
$td = $table[0]->getElementsByTagName('td'); // Fetch all td elements in the first table
echo $td[2]->nodeValue; // Echo out the value of the 3rd item (zero based arrays)
Prints out..
value3

xPath can be used to get particular element. Try the following code to get 3rd td value from given html.
$html = '<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$table = $dom->getElementsByTagName('table')->item(0);
$query = 'tr/td[3]';
$entries = $xpath->query($query, $table);
echo $entries[0]->nodeValue;
Read about DOMXpath query()
Update: Use of file_get_content is also simple, you can retrieve html/xml as string in $html variable and rest of the process is same:
$html = file_get_contents("path/to/file/x.html"); // target path

I think this should help you.
Just assign class Name for td elements and then use this
for eg:
<td class="two">value2</td>
$(this).closest('tr').children('td.two').text();

PHP DOM/xpath check element span class value

Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}

To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217

I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'

How to parse table inside table?

I have an html content, that look like this:
<table>
<tbody>
<tr>
<td>blabla</td>
<td>blabla</td>
</tr>
<tr>
<td>blabla</td>
<td>blabla</td>
</tr>
<tr>
<td>blabla</td>
<td><table>THIS IS MY TABLE CONTENT</table></td>
</tr>
</tbody>
</table>
I want to parse the THIS IS MY TABLE CONTENT and ONLY this table, the outer table is irrelevant for me.
I'm using Simple HTML DOM parser and right now my code look like this:
$table = $html->find('table');
foreach ($table->find('table') as $tbl){
foreach ($tbl->find('tr') as $tr){
foreach ($tr->find('td') as $td){
// some logic
}
}
}
My problem is, that this way I'm not getting any result. How can I perform this parsing the right way?
Thank you very much for the help!

What about using DOM with XPath
$str = '<table><tbody><tr><td>blabla</td><td>blabla</td></tr><tr><td>blabla</td><td>blabla</td></tr><tr><td>blabla</td><td><table>THIS IS MY TABLE CONTENT</table></td></tr></tbody></table>';
$dom = new DOMDocument();
$dom->loadHTML($str); // $str is your html string
$xpath = new DOMXPath($dom);
$tables = $xpath->query('.//td/table'); // fetch all tables inside td
foreach($tables as $table){
// Do your stuff with each table
echo $table->nodeValue; // $table is your current $table
}

The inner table would be:
$html->find('table table', 0);

xpath looping to unknown number of nodes

I've a xpath that looks like this:
$path = '//*[#id="page-content"]/table/tbody/tr[3]/td['.$i.']/div/a';
where $i goes from 1 to X. I would normaly use:
for($i=1; $i<X;$i++){
$path = '//*[#id="page-content"]/table/tbody/tr[3]/td['.$i.']/div/a';
$nodelist = $xpath->query($path);
$result = $nodelist->item(0)->nodeValue;
};
However, in this case, I dont know how much is X. Is there any way to loop through this without knowing X?

Why not just stack em? Something like (fragile code, add your checks):
// first xpath for the outer node-list
$tds = $xpath->query('//*[#id="page-content"]/table/tbody/tr[3]/td');
foreach ($tds as $td)
{
// fetch the included values with a relative xpath to the current node
$nodelist = $xpath->query('./div/a', $td);
...
}
And actually you wont even need that inner nodelist, because you want to query the node-values in the end. However I leave this here to show what you can do straight ahead by using an xpath relative to a concrete node.
So if you need the first <a> element inside any <div> inside the third <tr> of any table inside of any node with the id "page-content", you can write it as such directly, it is one query:
//*[#id="page-content"]/table/tbody/tr[3]/td/div/a[1]
The predicate (that are the brackets) is only for the node in the path prefixed to it, so the [1] is only for a at the end as was the [3] only for the tr.
Code Example:
$as = $xpath->query('//*[#id="page-content"]/table/tbody/tr[3]/td/div/a[1]');
foreach ($as as $a)
{
echo $a->nodeValue, "\n";
}
So this would give you the result as a single node-list, you do not need to run a second xpath query.

If I'm understanding your question, you're asking how to loop up until the max number of <td> elements under your XPath?
You could retrieve the number of nodes using:
count(//*[#id="page-content"]/table/tbody/tr[3]/td) and store it as a temp variable, then just use it in your next statement like so:
for($i=1; $i<numberOfTdElements;$i++){
$path = '//*[#id="page-content"]/table/tbody/tr[3]/td['.$i.']/div/a';
$nodelist = $xpath->query($path);
$result = $nodelist->item(0)->nodeValue;
};
In response to hakre's suggestion:
$tbody = $doc->getElementsByTagName('tbody')->item(0);
// our query is relative to the tbody node
$query = 'count(tr[3]/td)';
$tdcount = $xpath->evaluate($query, $tbody);
echo "There are $tdcount elements under tr[3]\n";
And then combine it all in:
for($i=1; $i<$tdcount;$i++){
$path = '//*[#id="page-content"]/table/tbody/tr[3]/td['.$i.']/div/a';
$nodelist = $xpath->query($path);
$result = $nodelist->item(0)->nodeValue;
};

I think what you are trying to do is fetch every a element that is a child of a div, which in its turn is a child of any td element that, in its turn, is a child of every third tr element, etc. If that is correct, you can simply fetch these with this query:
<?php
$doc = new DOMDocument();
$doc->loadXML( $xml );
$xpath = new DOMXPath( $doc );
$nodes = $xpath->query( '//*[#id="page-content"]/table/tbody/tr[3]/td/div/a' );
foreach( $nodes as $node )
{
echo $node->nodeValue . '<br>';
}
Where $xml is a document, similar to this:
<?php
$xml = <<<XML
<?xml version="1.0" encoding="utf-8" ?>
<result>
<div id="page-content">
<table>
<tbody>
<tr>
<td>
<div><a>This one shouldn't be fetched</a></div>
</td>
</tr>
<tr>
<td>
<div><a>This one shouldn't be fetched</a></div>
</td>
</tr>
<tr>
<td>
<div><a>This one should be fetched</a></div>
</td>
<td>
<div><a>This one should be fetched</a></div>
</td>
<td>
<div><a>This one should be fetched</a></div>
</td>
<td>
<div><a>This one should be fetched</a></div>
</td>
<td>
<div><a>This one should be fetched</a></div>
</td>
</tr>
<tr>
<td>
<div><a>This one shouldn't be fetched</a></div>
</td>
</tr>
</tbody>
</table>
</div>
</result>
XML;
In other words, no need to loop trough all these td elements. You can fetch them all in one go, resulting in a DOMNodeList with all required nodes.

$doc = new DOMDocument();
$doc->loadXML( $xml );
$xpath = new DOMXPath( $doc );
$nodes = $xpath->query( '/result/div[#id="page-content"]/table/tbody/tr[3]/td/div/a');
foreach( $nodes as $node )
{
echo $node->nodeValue . '<br>';
}

php domdocument or domxpath: how to extract TRs and save html

I have been struggling with this all day.
I have an html table in a string.
<TABLE>
<TBODY>
<TR CLASS=dna1>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
<TR CLASS=dna2>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
repeat...
Inside the <TD> are some <DIV> and <SPAN> that I need to work with.
I need to extract each <TR> (both classes) and save the html in an array where each <TR> is an array element.
Creating a node list array is easy enough, but how do I get the actual html?

If you must save the HTML as a string, there is DOMDocument::saveHTML
$elems = $xpath->query('//tr');
foreach ($elems as $elem) {
$array[] = $doc->saveHTML($elem);
}
(Note that the parameter for saveHTML is available as of PHP 5.3.6.)
I'd recommend saving the nodes themselves, though, and converting them to string only shortly before you output them.

Alternatively using DOMDocument only:
$dom = new DOMDocument();
#$dom->loadHTML($html);
if($table=$dom->getElementsByTagName('table')->item(0)){
//traverse the table and output every rows
$rows=array();
foreach ($table->childNodes as $row){
$rows[]=$dom->saveHTML($row);
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Convert HTML table to XML using PHP (DOMDocument?) - php

Related

How to get child element of DOMDocument

PHP DOM/xpath check element span class value

How to parse table inside table?

xpath looping to unknown number of nodes

php domdocument or domxpath: how to extract TRs and save html

Categories

Resources