I have a string, which consists of any html elements.
For example, I have this string:
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
As you can see, the string consists of <p>, <h2>, <table>, <span> tags, and it could also contain other html tags.
My question is, is there a way so that I can make the string remove all the other elements except the <table>, rest assured that there are no other tags other than thead, tr, td, tbody inside the table element?
This will probably be closed as a duplicate, but before that happens here’s some quick code to help you with your specific HTML. Instead of “removing” everything except your target text, we are “extracting” our target text. The code itself is pretty straightforward so I didn’t see a need to comment things as much as I usually do.
<?php
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
$dom = new DOMDocument();
$dom->loadHTML($htmlString, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->preserveWhiteSpace = true;
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table) {
var_dump($dom->saveHTML($table));
}
Demo here: https://3v4l.org/YjkdT
May this be the solution you are searching:
https://www.php.net/manual/en/function.strip-tags.php
<?php
$striped = strip_tags($htmlString, '<table>');
?>
Related
Here is the sample html:
$html = '<table>
<tbody>
<tr>
<td><span style="background-color: #f1c40f;">Cell 1</span></td>
<td>Cell 2</td>
</tr>
</tbody>
</table>';
Target is to retrieve everything between "td" tags, including the "span" tag, if any. Here is the expected result:
<span style="background-color: #f1c40f;">Cell 1</span>
Cell 2
I tried to use DOMDocument()->getElementsByTagName("td")->saveHTML() and nodeValue without success.
Here is the php code I tried:
$html = '<table>
<tbody>
<tr>
<td><span style="background-color: #f1c40f;">Cell 1</span></td>
<td>Cell 2</td>
</tr>
</tbody>
</table>';
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
$tds = $doc->getElementsByTagName("td");
foreach($tds as $td){
dump('saveHTML: '.$doc->saveHTML($td));
dump('nodeValue: '.$td->nodeValue);
}
die();
Here is the outputs:
^ "saveHTML: <td><span style="background-color: #f1c40f;">Cell 1</span></td>"
^ "nodeValue: Cell 1"
^ "saveHTML: <td>Cell 2</td>"
^ "nodeValue: Cell 2"
Did I do anything wrong? Thanks a lot!
Let's say I have the following HTML table:
<table>
<tbody>
<tr>
<th>Column 1</th>
<th>Column 2</th>
<th>Column 3</th>
<th>Column 4</th>
<th>Column 5</th>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
<td>Column 4</td>
<td>Column 5</td>
</tr>
</tbody>
</table>
How would I use the simple HTML DOM parser to return the table without columns 4 and 5? the first three columns of the table? I've started with this to get each row:
foreach ($html->find('tr') as $row) {
}
would I run another for loop to iterate through the cells?
You need to remove the 4th and 5th columns of each row (if they exist, and I just assume they do). Please note children of an element are zero-indexed:
foreach ($html->find('tr') as $row) {
// remove the 4th column
$row->children(3)->outertext = '';
// remove the 5th column
$row->children(4)->outertext = '';
}
print $html; // Outputs the table without 4th and 5th column of each row;
You can read a related answer here Simple HTML Dom: How to remove elements?
I'm trying to replace all TD tags within a THEAD to TH tags.
I figured using the PHP DOM extension would be best. I'm fairly new at it so I apologise for my lack of knowledge.
I did some searching and found how to replace tag names. However, I couldn't figure out how to only replace tag names within a parent (in this case the THEAD tag). I want to leave the TD's within the TBODY as is.
Here is my code to narrow down to the TD's within the THEAD. That's where I get lost.
How would I change the tag names in THEAD to TH?
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML(html);
// Get theads
$theads = $document->getElementsByTagName('thead');
// Loop through theads (incase there are more than one!)
for($i=0;$i<$theads->length;$i++) {
$thead = $theads->item($i);
// Loop through TR
foreach ($thead->childNodes AS $tr) {
if ($tr->nodeName == 'tr') {
// Loop through TD
foreach ($tr->childNodes AS $td) {
if ($td->nodeName == 'td') {
// Replace this tag
}
}
}
}
}
If you have checked the manual, there's this ->replaceChild() method you can use to replace td to th tags:
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML($html);
// Get theads
$theads = $document->getElementsByTagName('thead')->item(0); // get thead tag
foreach($theads->childNodes as $tr) { // loop thead rows `tr`
$tds = $tr->getElementsByTagName('td'); // get tds inside trs
$i = $tds->length - 1;
while($i > -1) {
$td = $tds->item($i); // td
$text = $td->nodeValue; // text node
$th = $document->createElement('th', $text); // th element with td node value
$td->parentNode->replaceChild($th, $td); // replace
$i--;
}
}
echo $document->saveHTML();
Doc notes
Sample Output
Code sample:
$html = <<<END
<tr>
<td>Text-1</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
<tr>
<td>Text-2</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
END;
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
// Grab the text
$nodes = $xpath->query('//td[contains(text(), "Text-2")]|//td[contains(text(), "Blah 1")]/following-sibling::td');
echo $nodes->item(0)->textContent;
I'm trying to grab Blah 2 under Text-2, the problem is that's grabbing Blah 2 under Text-1
First, find the <tr> that has a <td> child with "Text-2", then traverse down the next sibling until you find a <td> that has "Blah 1". The answer is the next sibling of that.
//tr[contains(td/text(), "Text-2")]/following-sibling::tr/td[contains(text(), "Blah 1")]/following-sibling::td'
Found the answer:
//*[text()='Text-2']/following::td[text()='Blah 1']/following-sibling::td
I just recently read about the DOM module in PHP and now I'm trying to use it for parsing a HTML document. The page said that this was a much better solution than using preg but I'm having a hard time figuring out how to use it.
The page contains a table with dates and X number of events for the date.
First I need to get the text (a date) from a tr with valign="bottom" and then I need to get all the column values from all the tr with valign="top" who is below that tr. I need all the column values from each tr below the tr with the date up until the next tr with valign="bottom" (next date). The number of tr with column data is unknown, can be zero or a lot of them.
This is what the HTML on the page looks like:
<table>
<tr valign="bottom">
<td colspan="4">2009-02-26</td>
</tr>
<tr valign="top">
<td>21:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>23:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="bottom">
<td colspan="4">2009-02-27</td>
</tr>
<tr valign="top">
<td>06:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>10:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>13:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
</table>
So far I've been able to get the first two dates (I'm only interested in the first two) but I don't know how to go from here.
The xpath query I use to get the date trs is
$result = $xpath->query('//tr[#valign="bottom"][position()<3]);
Now I need a way to connect all the events for that day to the date, ie. select all the tds and all the column values up until the next date tr.
$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();
$html = new DOMDocument();
$html->loadHtmlFile('http://url/table.html');
$xpath = new DOMXPath( $html );
$elements = $xpath->query( "//table/tr" );
foreach ( $elements as $item ) {
$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($item,true));
$xpath = new DOMXPath( $newDom );
foreach ($item->attributes as $attribute) {
for ($node = $item->firstChild; $node !== NULL;
$node = $node->nextSibling) {
if (($attribute->nodeName =='valign') && ($attribute->nodeValue=='top'))
{
print($node->nodeValue);
}
else
{
print("<br>".$node->nodeValue);
}
}
print("<br>");
}
}
libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );
Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
Use following-sibling().
This XPath expression
/table/tr/td[#colspan=4]
or
/table/tr[valign='bottom']/td
Result in a node set with date cells.
How to get cells between marks?
/table/tr/td[not(#colspan=4)][preceding::td[#colspan=4][1]='2009-02-26']