Convert HTML table to text

Convert HTML table to text - php

I'm working on a project that requires to convert html email into text. Below is a simplified version of the HTML code:
<table>
<tr>
<td width="10%"></td>
<td width="60%"> test product </td>
<td width="20%">5</td>
<td width="10%"> £50.00 </td>
</tr>
<tr>
<td></td>
<td colspan="3" width="100%"> Project Name: Test Project </td>
</tr>
<tr>
<td width="10%"> </td>
<td colspan="2" width="80%"> Page 1 : 01 New York 1.jpg </td>
<td width="10%"> £0.00 </td>
</tr>
</table>
The expected outcome should look like this in a text file (with columns aligned nicely):
test product 5 £50.00
Project Name: Test Project
Page 1 : 01 New York 1.jpg £0.00
My idea is parsing the HTML content by DOMDocument. Then I will set a default width for the table (i.e.: 100 spaces) then convert the width of each column from % to number of spaces (based on colspan & width attribute of <td> tag). Then I will subtract these column width to strlen of the data in each column to archive the number of spaces I need to pad_right to the string to make everything align vertically.
I have been working that way, hasn't been archived what I want but just wondering if it is stupid or anyone knows a better way please help me out.
Also when it comes to Multibyte languages (Japanese, Korean etc...) I don't think my approach would work because their characters will be bigger than one space and it end up a mess.
Can someone help me out please?

Don't reinvent the wheel. Table rendering is difficult, rendering tables using only text is even more difficult.
To clarify the complexity of a text-based table renderer that offers all the features of HTML, take a look at w3m, which is open source:
these 3000 lines of code are there only to display html tables.
Transform HTML to Text
There are textbased browsers that can be used by command line, like lynx.
You could fwrite your html table into a file, pass that file into the textbased browser and take its output.
Note: textbased browsers are generally used in a shell, which generally displays in monospace. This remains a prerequisite.
lynx and w3m are both available on Windows and you don't need to "install" them, you just need to have the executables and the permission to run them from PHP.
code example:
<?php
$table = '<table><tr><td>foo</td><td>bar</td></tr></table>'; //this contains your table
$html = "<html><body>$table</body></html>";
//write html file
$tmpfname = tempnam(sys_get_temp_dir(), "tblemail");
$handle = fopen($tmpfname, "w");
fwrite($handle, $html);
fclose($handle);
$myTextTable = shell_exec("w3m.exe -dump \"$tmpfname\"");
unlink($tmpfname);
w3m.exe needs to be in your working directory.
(didn't try it)
Render a Text table
If you want a native PHP solution, there's also at least one framework (https://github.com/c9s/CLIFramework) aimed at console applications for PHP which has a table renderer.
It doesn't transform HTML to text, but it helps you build a text formatted table with support for multiline cells (which seems to be the most complicated part).
Using CLIFramework you would need a code like this to render your table:
<?php
require 'vendor/autoload.php';
use CLIFramework\Component\Table\Table;
$table = new Table;
$table->addRow(array(
"test product", "5", "£50.00"
));
$table->addRow(array(
"Project Name: Test Project", "", ""
));
$table->addRow(array(
"Page 1 : 01 New York 1.jpg", "", "£0.00"
));
$myTextTable = $table->render();
The CLIFramework table renderer doesn't seem to support anything similar to "colspan" however.
Here's the documentation for the table component: https://github.com/c9s/CLIFramework/wiki/Using-Table-Component

Related

Splitting a Value from a String and Appending it to the end of that Value in PHP

I don't know if this is possible as I'm unable to find how I can do this and I'm very new to PHP but here's an overview of my issue:
I have a script which reads a CSV file. One of the columns contains cells which contain HTML tables. At varying positions within all of the tables there exists a table row which contains <td>Retail</td> and then the price such as <td>$300</td> for example. An example is below which I have formatted so that it's easier for you to read, but this is returned as a continuous string from the CSV file normally:
<table>
<tr>
<td>Designer</td>
<td>Hermes</td>
</tr>
<tr>
<td>Size inch</td>
<td>5.9 x 4.3 x 2.4</td>
</tr>
<tr>
<td>Material</td>
<td>Cotton</td>
</tr>
<tr>
<td>Retail</td>
<td>$300.00</td>
</tr>
<tr>
<td>Made in</td>
<td>France</td>
</tr>
</table>
These tables are then required to have the CAD [Canadian Dollars] retail price added to them. Example below of the desired end result:
<table>
<tr>
<td>Designer</td>
<td>Hermes</td>
</tr>
<tr>
<td>Size inch</td>
<td>5.9 x 4.3 x 2.4</td>
</tr>
<tr>
<td>Material</td>
<td>Cotton</td>
</tr>
<tr>
<td>Retail USD</td>
<td>$300.00</td>
</tr>
<tr>
<td>Retail CAD</td>
<td>$410.00</td>
</tr>
<tr>
<td>Made in</td>
<td>France</td>
</tr>
</table>
I have looked at using substr() but it looks as though you need to specify the length of characters that will be ignored from the start of the string which isn't possible for me here as the data varies.
So therefore my question is whether it's at all possible to specifically split the price out from the string and then append it back in after the </tr> so that the result is as above. If you could point me in the right direction of the functions that I would need to use to achieve this then I would really appreciate it. Please bear in mind I am already using str_replace() to rename Retail to Retail USD and I already have a variable created ready to convert USD price to a CAD price which uses a finance API.
Thank you in advance for any insight you can offer me here.

I have looked at using substr() but it looks as though you need to specify the length of characters that will be ignored from the start of the string which isn't possible for me here as the data varies.
So use stripos to find the start of the string you want to replace.
However the more I dig into this, it because a mess very quickly. It would be better to edit the CSV generator rather than trying to mutate your CSV. It would also in an ideal world be better your CSV contained only data and not HTML.
Apologies the following became a large and probably unwieldy answer:
However to do it, you need to isolate this CSV column, into a variable $csvData. Then work with it directly:
$csvData = "<table data from your question>";
$csvData = str_replace("</td>","*!*</td>",$csvData);
//remove all the HTML junk
$csvDataClean = strip_tags($csvData);
// Form an array.
$csvDataArray = explode("*!*",$csvDataClean);
// trim contents of the array.
$csvDataArray = array_map('trim', $csvDataArray);
// remove empty array values.
$csvDataArray = array_filter($csvDataArray);
// build new contents array.
foreach($csvDataArray as $key=>$value){
if($key%2 == 0){
//odd number. Is a content header.
$value = str_replace(" ","_",$value);
$lastHeader = preg_replace("/[^a-z0-9-_]/i","",$value);
}
else {
//even number, it's a value
$csvArray[$lastHeader] = $value;
}
}
//tidy up.
unset($key,$value,$lastHeader,$csvDataArray,$csvDataClean);
print_r($csvArray);
This will now output for you an array of headers and values from your HTML table. You can then easily reference values from this array and then recompile them into an HTML table as nessecary.
Using phpsandbox I can output:
Array
(
[Designer] => Hermes
[Size_inch] => 5.9 x 4.3 x 2.4
[Material] => Cotton
[Retail] => $300.00
[Made_in] => France
)
So you can then take $csvArray['Retail'] and process this value to get the other currency values, and add them to this array. Then you can run this array through another process to rebuild a table, to save into the CSV (although this doesn't come recommended, it's better to save the arraty as a CSV itself, but I don't know your requirements).
So:
//whatever system you currently use to get conversion.
$csvArray['Retail_CAD'] = convert_currency($csvArray['Retail']);
$csvArray['Retail_USD'] = convert_currency($csvArray['Retail']);
And now rebuild the HTML form:
foreach($csvArray as $key=>$value){
$csvOutput .= "<tr><td>".str_replace("_"," ",$key)."</td><td>".$value."</td></tr>\n";
}
unset($key,$value);
$csvOutput = "<table>".$csvOutput."</table>";
print_r($csvOutput);
You can also manually delete and readd the Made_in array key if you want to maintain this as the final array value:
//whatever system you currently use to get conversion.
$csvArray['Retail_CAD'] = convert_currency($csvArray['Retail']);
$csvArray['Retail_USD'] = convert_currency($csvArray['Retail']);
....
$value = $csvArray['Made_in'];
unset($csvArray['Made_in']);
$csvArray['Made_in'] = $value;
This is a hacky but quick way of keeping the "made in" column after the new Retail columns added above.

What you pasted here is a html table, not csv.
Anyway, there are several ways to manipulate strings. str_replace() is one of the most basic ones, so you got that already. In your case, you're probably best off using regular expressions. It's like str_replace but much more powerful. There are plenty of tutorials out there.
If you want to do a lot and more complex manipulation of html or xml data, you may want to have a look at XSLT.

I had to deal with a similar scenario once, what I would do is:
1.-Form first your desired output block in a variable $output_block i.e :
<td>Retail USD</td><td>$300.00</td></tr><tr><td>Retail CAD</td><td>$410.00</td>
note: you dont need the firs opening tr tag neither the last closing one cause you already have those on your original output.
2.-find the position of <td>Retail</td>
(use strpos)
3.-Save the substring you have before in a a variable i.e: $first_part
4.-find the position of <td>Made in</td>
5.-Save the substring you have after this in a variable : $last_part
6.- Your final output: $final_output = $firstpart . $output_block . $last_part;
easy cake... ;)

PHP resorting tr's in a table by class

I have a pretty large table which I put on the page with a php call
<?php include('7c2dsf12c24-4441e-532ded8-88dsc7-4fsd2c8.txt'); ?>
That file has thousands of TR's and TD's within.
The text file is dynamically created and updated every couple of hours.
Some of the rows have a "featuredRow" class on them, which helps with styling.
However, they appear in a random order in that text file.
I need to sort them so that the featured rows go first. Basically take all the rows, and put all the featuredRows at the top of the table, followed by all the other rows.
I already have javascript code that sorts the table by different td's alphabetically, but since its a front-end sorting, and the table consists of thousands of tr's (the text file is 7mb of text), it is quite a strain on Internet Explorer, if I was to filter it initially (the user expects to wait a long time when reordering the entire table alhpabetically, but he is not expected to wait 30 seconds until everything is ordered right (only 2-3 seconds on chrome is 20-30 seconds on IE)).
Therefore I figured that doing it on the backend, and displaying a reordered text file right away would be better, instead of using the dom, to create huge arrays and lag out the user's browser.
TL/DR
As an example, the file's structure looks like something like this:
<tr><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr><td></td><td></td></tr>
I need to take that file and reorder the structure to this:
<tr class="featuredRow"><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr class="featuredRow"><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
And I do not want to use JS since there are thousands of rows, megabytes of data, and it will take a long time on IE to do it on the front-end.
What is the easiest way to make it work in PHP?
Thank you
P.S.
Here is how the html/php looks like now - jsfiddle.net/1pggwuah
Here is another link on how two trs of the text file look like (there are about 3,000-4,000 of those trs in the text file) jsfiddle.net/a308w8b6

$rows = file_get_contents('/path/to/rows.html');
$rows = explode('<tr', $rows);
sort($rows);
$rows = implode('<tr', $rows);
Demo: https://ideone.com/bITopF

$domd=#DOMDocument::loadHTMLFile('7c2dsf12c24-4441e-532ded8-88dsc7-4fsd2c8.txt');
$masterele=$domd->getElementsByTagName("table")->item(0);
foreach($domd->getElementsByTagName("tr") as $trele){
if($trele->getAttribute("class")!=="featuredRow"){continue;}
$trele->parentNode->removeChild($trele);
$masterele->insertBefore($trele,$masterele->firstChild);
}
echo $domd->saveHTML();
EDIT: the original code would put "featuredRow"'s at the bottom, not the top, sorry, fixed it now (use insertBefore instead of append)

Php_simple_html_dom on a table

I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!

So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.

Testing for a string in a particular element node

in PHP Unit (using Selenium Server) i'm trying to check if a particular element node in xpath has a certain string value, for instance
<table>
<thead></thead>
<tbody>
<tr>
<th>1</th>
<td>value 1</td>
<td>value 2</td>
<td>value c</td>
</tr>
<tr>
<th>2</th>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
with the above code, using the xpath
isElementPresent('//table/tbody/tr[last()-0]/td[last()-1]');
it would return true, if i changed tr[last()-0] to tr[last()-1] it would still return true,
naturally, the isElementPresent would be in a loop with the xpath generated in the loop as well (substituting the integer for $i and $j which are used in the for() loop) as as it is, that would be fine, however, what i want to check is that the has nothing in it
using the same html code above, if i change the xpath
isElementPresent('//table/tbody/tr[last()-$i]/td[last()-1 and text()="${nbsp}"]');
you would think that it would return true at //table/tbody/tr[last()-0]/td[last()-1 and text()="${nbsp}"] and false at //table/tbody/tr[last()-1]/td[last()-1 and text()="${nbsp}"] however here is the kicker
using Selenium IDE 1.10.0 Plugin for Firefox to check the xpath by putting it in the Target Box and hitting find (to check that it will locate the xpath when it should, //table/tbody/tr[last()-0]/td[last()-1 and text()="${nbsp}"] doesn't highlight the 2nd last td in the last tr, it highlights the first td in the last tr, as if the xpath was //table/tbody/tr[last()-0]/td[last()-2]
from my experiments, it seems to be treating the xapth like //table/tbody/tr[last()-0]/td[text()="${nbsp}"] which would only be the FIRST instance in which text is a blank space, not good if the 2nd tr was like
<tr>
<th>2</th>
<td>cows are my friends</td>
<td>let's go to my room pig!</td>
<td> </td>
</tr>
and i was to use isElementPresent('//table/tbody/tr[last()-0]/td[last()-1 and text()="${nbsp}"]'); it would still return true as its not looking at last()-1 but last()-0
so my question is, how can i check if a particular element node has a certain string
NOTE 1: i use last()-# cause on this page http://www.w3schools.com/xpath/xpath_syntax.asp it says IE5 and later says [0] is the first not and not [1] like Firefox or Chrome which in a sense makes sense since that's how an array's index works, for full compatibility, i start from the last Node and work backwards which last() would work with IE5 and later and the logic of moving backwards though the nodes should be the same (unless microsoft wants to redefine that logic)
NOTE 2: i am well aware that a simple fix is to add title or id attributes to the table however the page i'm making the test for was done by someone else, i would like to avoid modifying the page just to suit test cases
NOTE 3: the table i'm testing is populated using a JSON string so my test is when there is no data for the table, is it blank, if not, than the JSON string is adding data to the table when it shouldn't
EDIT 1: seems like ${nbsp} doesn't work in php, only in the Selenium IDE 1.10.0 Plugin seemed to recognize it however inserting a space by holding Alt and typing in 0160 worked just as fine
EDIT 2: for the time being i have added id attributes to the tags to get this working and it works perfectly fine with checking #id=[VALUE] and text()=[VALUE] but it would still be good to get this question answered as while i add id, title and/or class attributes to all my html tags the person who originally made the table i was testing obviously didn't and as i said in NOTE 2, 'i would like to avoid modifying the page just to suit test cases'

#Memor-X Use SimpleXML to load the xml. Then it should be easy to test in phpunit. If I was you I would build in dependency tests to validate the xml structure before testing the contents.

Parsing information between known variables

I dont mean to be a bother and I know this has been asked a thousand times before but i'm just not understanding the concept. I was wondering if somebody could walk me through it, Here is what i'm trying to do:
I have a set of information inside an html file. The file is uploaded to the server and i need to parse information out of the file inside of set parameters (demo code to follow). I have been reading on parsing for over a week and understand some of it but just not grasping the concept, i guess i just need somebody to do one on this demo for me to understand and if you could, break down the search variables please. Here's the demo:
<hr>
<a id="Operating_System"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Operating System</FONT></CAPTION>
<tr><td>Top</td></tr>
<TR ALIGN="LEFT" BGCOLOR="#00FF00">
<TH>Property</TH>
<TH>Value</TH>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Name</TD>
<TD>Windows 7 Professional x64 Service Pack 1</TD>
</TR>
<TR>
<TD>Features</TD>
<TD>Terminal Services in Remote Admin Mode, 64 Bit Edition, Media Center Edition, Multiprocessor Free</TD>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Up Time</TD>
<TD>5 Days 22 Hours 4 Minutes 26 seconds</TD>
</TR>
<!-- Operating System Duration: 1.853 seconds -->
</table>
<hr>
<a id="Installed_Updates"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Installed Updates</FONT></CAPTION>
and here is what i'm trying to accomplish. On this demo, i would need the information parsed but only certain information to come back. there is a lot more information here but only need about 30 things total on each document. first i need to search from Operating_System to Installed_Updates, this will give me the first set area i need to gather information (there is other groups too so i'll make one for each group of information). The i need to make the search more specific such as from <TR> to </TR> which will give me the actual information set i need. After that just grap the first 'name' and 'value' to store in a database.
Again, i know it's out there but i'm just not getting the whole concept of simple expressions. After i do it a few times on an actual document, i'll get the hang of it i think.
Thank you all so much for the help, i really appreciate it.

This only works for fixed HTML with little variations. But if you just want a simple example, here is one:
preg_match('#<TD>Up Time</TD>.*?<TD>([\w ]+)</TD>#is', $html, $match);
print $match[1]; # ^^^^^^
See also https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for some tools. And http://regular-expressions.info/ to learn the syntax.
But as said, if you want to extract a lot of values, there are easier options.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Convert HTML table to text - php

Related

Splitting a Value from a String and Appending it to the end of that Value in PHP

PHP resorting tr's in a table by class

Php_simple_html_dom on a table

Testing for a string in a particular element node

Parsing information between known variables

Categories

Resources