I just recently read about the DOM module in PHP and now I'm trying to use it for parsing a HTML document. The page said that this was a much better solution than using preg but I'm having a hard time figuring out how to use it.
The page contains a table with dates and X number of events for the date.
First I need to get the text (a date) from a tr with valign="bottom" and then I need to get all the column values from all the tr with valign="top" who is below that tr. I need all the column values from each tr below the tr with the date up until the next tr with valign="bottom" (next date). The number of tr with column data is unknown, can be zero or a lot of them.
This is what the HTML on the page looks like:
<table>
<tr valign="bottom">
<td colspan="4">2009-02-26</td>
</tr>
<tr valign="top">
<td>21:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>23:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="bottom">
<td colspan="4">2009-02-27</td>
</tr>
<tr valign="top">
<td>06:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>10:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>13:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
</table>
So far I've been able to get the first two dates (I'm only interested in the first two) but I don't know how to go from here.
The xpath query I use to get the date trs is
$result = $xpath->query('//tr[#valign="bottom"][position()<3]);
Now I need a way to connect all the events for that day to the date, ie. select all the tds and all the column values up until the next date tr.
$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();
$html = new DOMDocument();
$html->loadHtmlFile('http://url/table.html');
$xpath = new DOMXPath( $html );
$elements = $xpath->query( "//table/tr" );
foreach ( $elements as $item ) {
$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($item,true));
$xpath = new DOMXPath( $newDom );
foreach ($item->attributes as $attribute) {
for ($node = $item->firstChild; $node !== NULL;
$node = $node->nextSibling) {
if (($attribute->nodeName =='valign') && ($attribute->nodeValue=='top'))
{
print($node->nodeValue);
}
else
{
print("<br>".$node->nodeValue);
}
}
print("<br>");
}
}
libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );
Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
Use following-sibling().
This XPath expression
/table/tr/td[#colspan=4]
or
/table/tr[valign='bottom']/td
Result in a node set with date cells.
How to get cells between marks?
/table/tr/td[not(#colspan=4)][preceding::td[#colspan=4][1]='2009-02-26']
Related
I have a string, which consists of any html elements.
For example, I have this string:
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
As you can see, the string consists of <p>, <h2>, <table>, <span> tags, and it could also contain other html tags.
My question is, is there a way so that I can make the string remove all the other elements except the <table>, rest assured that there are no other tags other than thead, tr, td, tbody inside the table element?
This will probably be closed as a duplicate, but before that happens here’s some quick code to help you with your specific HTML. Instead of “removing” everything except your target text, we are “extracting” our target text. The code itself is pretty straightforward so I didn’t see a need to comment things as much as I usually do.
<?php
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
$dom = new DOMDocument();
$dom->loadHTML($htmlString, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->preserveWhiteSpace = true;
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table) {
var_dump($dom->saveHTML($table));
}
Demo here: https://3v4l.org/YjkdT
May this be the solution you are searching:
https://www.php.net/manual/en/function.strip-tags.php
<?php
$striped = strip_tags($htmlString, '<table>');
?>
I have a question how to underline in the table according the column data. Below is example coding to explain what I am facing the problem:
I want to detect if column underline is 1 the first name data will draw the underline, if 0 the first name data no show the underline. Below the sample is hardcode, if real situation, I have too many row to show the data, I cannot 1 by 1 to add text-decoration: underline; in the td. So that, hope someone can guide me how to solve this problem. I am using the php code to make the variable to define the underline.
<!--Below the php code I just write the logic, because I don't know how to write to detect the column underline value-->
<?php
if ( <th>Underline</th> == 1) {
$add_underline = "text-decoration: underline;";
}
if ( <th>Underline</th> == 0) {
$add_underline = "text-decoration: underline;";
}
?>
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Underline</th>
</tr>
<tr>
<td style="<?php echo $add_underline;?> ">Jill</td>
<td>Smith</td>
<td>1</td>
</tr>
<tr>
<td style="<?php echo $add_underline;?>">Eve</td>
<td>Jackson</td>
<td>0</td>
</tr>
<tr>
<td style="<?php echo $add_underline;?>">John</td>
<td>Doe</td>
<td>1</td>
</tr>
</table>
My output like below the picture:
My expected result like below the picture, Jill and John can underline:
Why not use javascript to achieve this? No matter what the server sends it will evaluate the condition if 1 is set and then underline accordingly... You would have to use classes to get the appropriate table data tags holding the values, I added class='name' to the names <td> tag and class='underline' tot he underline <td> tag.
// get the values of the elements with a class of 'name'
let names = document.getElementsByClassName('name');
// get the values of the elements with a class of 'underline'
let underline = document.getElementsByClassName('underline');
// loop over elements using for and use the keys to get and set values
// `i` will iterate until it reaches the length of the list of elements with class of underline
for(let i = 0; i < underline.length; i++){
// use the key to get the text content and check if 1 is set use Number to change string to number for strict evaluation
if(Number(underline[i].textContent) === 1){
// set values set to 1 to underline in css style
names[i].style.textDecoration = "underline";
}
}
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Underline</th>
</tr>
<tr>
<td class="name">Jill</td>
<td>Smith</td>
<td class='underline'>1</td>
</tr>
<tr>
<td class="name">Eve</td>
<td>Jackson</td>
<td class='underline'>0</td>
</tr>
<tr>
<td class="name">John</td>
<td>Doe</td>
<td class='underline'>1</td>
</tr>
</table>
Or using the td child values...
let tr = document.querySelectorAll("tr");
last = null;
for(let i = 1; i < tr.length; i++){
if(Number(tr[i].lastElementChild.innerHTML) === 1){
tr[i].firstElementChild.style.textDecoration = "underline";
}
}
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Underline</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>1</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>0</td>
</tr>
<tr>
<td>John</td>
<td>Doe</td>
<td>1</td>
</tr>
</table>
Let's say I have the following HTML table:
<table>
<tbody>
<tr>
<th>Column 1</th>
<th>Column 2</th>
<th>Column 3</th>
<th>Column 4</th>
<th>Column 5</th>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
<td>Column 4</td>
<td>Column 5</td>
</tr>
</tbody>
</table>
How would I use the simple HTML DOM parser to return the table without columns 4 and 5? the first three columns of the table? I've started with this to get each row:
foreach ($html->find('tr') as $row) {
}
would I run another for loop to iterate through the cells?
You need to remove the 4th and 5th columns of each row (if they exist, and I just assume they do). Please note children of an element are zero-indexed:
foreach ($html->find('tr') as $row) {
// remove the 4th column
$row->children(3)->outertext = '';
// remove the 5th column
$row->children(4)->outertext = '';
}
print $html; // Outputs the table without 4th and 5th column of each row;
You can read a related answer here Simple HTML Dom: How to remove elements?
I'm trying to replace all TD tags within a THEAD to TH tags.
I figured using the PHP DOM extension would be best. I'm fairly new at it so I apologise for my lack of knowledge.
I did some searching and found how to replace tag names. However, I couldn't figure out how to only replace tag names within a parent (in this case the THEAD tag). I want to leave the TD's within the TBODY as is.
Here is my code to narrow down to the TD's within the THEAD. That's where I get lost.
How would I change the tag names in THEAD to TH?
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML(html);
// Get theads
$theads = $document->getElementsByTagName('thead');
// Loop through theads (incase there are more than one!)
for($i=0;$i<$theads->length;$i++) {
$thead = $theads->item($i);
// Loop through TR
foreach ($thead->childNodes AS $tr) {
if ($tr->nodeName == 'tr') {
// Loop through TD
foreach ($tr->childNodes AS $td) {
if ($td->nodeName == 'td') {
// Replace this tag
}
}
}
}
}
If you have checked the manual, there's this ->replaceChild() method you can use to replace td to th tags:
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML($html);
// Get theads
$theads = $document->getElementsByTagName('thead')->item(0); // get thead tag
foreach($theads->childNodes as $tr) { // loop thead rows `tr`
$tds = $tr->getElementsByTagName('td'); // get tds inside trs
$i = $tds->length - 1;
while($i > -1) {
$td = $tds->item($i); // td
$text = $td->nodeValue; // text node
$th = $document->createElement('th', $text); // th element with td node value
$td->parentNode->replaceChild($th, $td); // replace
$i--;
}
}
echo $document->saveHTML();
Doc notes
Sample Output
Code sample:
$html = <<<END
<tr>
<td>Text-1</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
<tr>
<td>Text-2</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
END;
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
// Grab the text
$nodes = $xpath->query('//td[contains(text(), "Text-2")]|//td[contains(text(), "Blah 1")]/following-sibling::td');
echo $nodes->item(0)->textContent;
I'm trying to grab Blah 2 under Text-2, the problem is that's grabbing Blah 2 under Text-1
First, find the <tr> that has a <td> child with "Text-2", then traverse down the next sibling until you find a <td> that has "Blah 1". The answer is the next sibling of that.
//tr[contains(td/text(), "Text-2")]/following-sibling::tr/td[contains(text(), "Blah 1")]/following-sibling::td'
Found the answer:
//*[text()='Text-2']/following::td[text()='Blah 1']/following-sibling::td