DOMXPath Match text

DOMXPath Match text - php

Code sample:
$html = <<<END
<tr>
<td>Text-1</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
<tr>
<td>Text-2</td>
</tr>
<tr>
<td>Blah 1</td>
<td>Blah 2</td>
<td>Blah 3</td>
<td>Blah 4</td>
</tr>
END;
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
// Grab the text
$nodes = $xpath->query('//td[contains(text(), "Text-2")]|//td[contains(text(), "Blah 1")]/following-sibling::td');
echo $nodes->item(0)->textContent;
I'm trying to grab Blah 2 under Text-2, the problem is that's grabbing Blah 2 under Text-1

First, find the <tr> that has a <td> child with "Text-2", then traverse down the next sibling until you find a <td> that has "Blah 1". The answer is the next sibling of that.
//tr[contains(td/text(), "Text-2")]/following-sibling::tr/td[contains(text(), "Blah 1")]/following-sibling::td'

Found the answer:
//*[text()='Text-2']/following::td[text()='Blah 1']/following-sibling::td

Related

PHP function that removes anything else but the specified part of string

I have a string, which consists of any html elements.
For example, I have this string:
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
As you can see, the string consists of <p>, <h2>, <table>, <span> tags, and it could also contain other html tags.
My question is, is there a way so that I can make the string remove all the other elements except the <table>, rest assured that there are no other tags other than thead, tr, td, tbody inside the table element?

This will probably be closed as a duplicate, but before that happens here’s some quick code to help you with your specific HTML. Instead of “removing” everything except your target text, we are “extracting” our target text. The code itself is pretty straightforward so I didn’t see a need to comment things as much as I usually do.
<?php
$htmlString = '<p>Test</p>
<h2>Test2</h2>
<table>
<thead>
<tr>
<td>Header 1</td>
<td>Header 2</td>
</tr>
</thead>
<tbody>
<tr>
<td>Col 1</td>
<td>Col 2</td>
</tr>
</tbody>
</table>
<span>Test span </span>
';
$dom = new DOMDocument();
$dom->loadHTML($htmlString, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->preserveWhiteSpace = true;
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table) {
var_dump($dom->saveHTML($table));
}
Demo here: https://3v4l.org/YjkdT

May this be the solution you are searching:
https://www.php.net/manual/en/function.strip-tags.php
<?php
$striped = strip_tags($htmlString, '<table>');
?>

Php Dom getting nodeValue without stripping inline element tags

Here is the sample html:
$html = '<table>
<tbody>
<tr>
<td><span style="background-color: #f1c40f;">Cell 1</span></td>
<td>Cell 2</td>
</tr>
</tbody>
</table>';
Target is to retrieve everything between "td" tags, including the "span" tag, if any. Here is the expected result:
<span style="background-color: #f1c40f;">Cell 1</span>
Cell 2
I tried to use DOMDocument()->getElementsByTagName("td")->saveHTML() and nodeValue without success.
Here is the php code I tried:
$html = '<table>
<tbody>
<tr>
<td><span style="background-color: #f1c40f;">Cell 1</span></td>
<td>Cell 2</td>
</tr>
</tbody>
</table>';
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
$tds = $doc->getElementsByTagName("td");
foreach($tds as $td){
dump('saveHTML: '.$doc->saveHTML($td));
dump('nodeValue: '.$td->nodeValue);
}
die();
Here is the outputs:
^ "saveHTML: <td><span style="background-color: #f1c40f;">Cell 1</span></td>"
^ "nodeValue: Cell 1"
^ "saveHTML: <td>Cell 2</td>"
^ "nodeValue: Cell 2"
Did I do anything wrong? Thanks a lot!

Remove empty tags with line breaks from HTML

I have the following HTML:
<body>Summary: <br>
<table class="stats data tablesorter marg-bottom">
<thead><tr><th>Team</th><th>Wins</th><th>Losses</th><th>Ties</th><th>Win %</th></tr></thead>
<tbody>
<tr>
<td>Team 1</td>
<td>95</td>
<td>74</td>
<td>0</td>
<td>56.21</td>
</tr>
<tr>
<td>Team 2</td>
<td>74</td>
<td>95</td>
<td>0</td>
<td>43.79</td>
</tr>
</tbody>
</table>
<div>
</div>
</body>
And I want this as result:
<body>Summary: <br>
<table class="stats data tablesorter marg-bottom">
<thead><tr><th>Team</th><th>Wins</th><th>Losses</th><th>Ties</th><th>Win %</th></tr></thead>
<tbody>
<tr>
<td>Team 1</td>
<td>95</td>
<td>74</td>
<td>0</td>
<td>56.21</td>
</tr>
<tr>
<td>Team 2</td>
<td>74</td>
<td>95</td>
<td>0</td>
<td>43.79</td>
</tr>
</tbody>
</table>
</body>
Easiest would be to code it correctly, unfortunately, this comes out of a very very old version of CKEditor and I can't upgrade it (due to other implications).
What preg_replace or recursive function or loop can I run to remove the empty <div> tags and the unneeded empty lines?

Assuming you have this HTML in a variable called $html:
// Replace empty <div> tags with nothing
$html = preg_replace("/<div>\s*<\/div>/", "", $html);
// Replace multiple newlines in a row with a single newline
$html = preg_replace("/\n+/", "\n", $html);
echo $html;
EDIT
Full working code, including output:
<?php
$html = <<<END
<body>Summary: <br>
<table class="stats data tablesorter marg-bottom">
<thead><tr><th>Team</th><th>Wins</th><th>Losses</th><th>Ties</th><th>Win %</th></tr></thead>
<tbody>
<tr>
<td>Team 1</td>
<td>95</td>
<td>74</td>
<td>0</td>
<td>56.21</td>
</tr>
<tr>
<td>Team 2</td>
<td>74</td>
<td>95</td>
<td>0</td>
<td>43.79</td>
</tr>
</tbody>
</table>
<div>
</div>
</body>
END;
// Replace empty <div> tags with nothing
$html = preg_replace("/<div>\s*<\/div>/", "", $html);
// Replace multiple newlines in a row with a single newline
$html = preg_replace("/\n+/", "\n", $html);
echo $html;
// OUTPUT:
// <body>Summary: <br>
// <table class="stats data tablesorter marg-bottom">
// <thead><tr><th>Team</th><th>Wins</th><th>Losses</th><th>Ties</th><th>Win %</th></tr></thead>
// <tbody>
// <tr>
// <td>Team 1</td>
// <td>95</td>
// <td>74</td>
// <td>0</td>
// <td>56.21</td>
// </tr>
// <tr>
// <td>Team 2</td>
// <td>74</td>
// <td>95</td>
// <td>0</td>
// <td>43.79</td>
// </tr>
// </tbody>
// </table>
// </body>
?>

PHP DOM - Replacing TD tags in a THEAD to TH tags

I'm trying to replace all TD tags within a THEAD to TH tags.
I figured using the PHP DOM extension would be best. I'm fairly new at it so I apologise for my lack of knowledge.
I did some searching and found how to replace tag names. However, I couldn't figure out how to only replace tag names within a parent (in this case the THEAD tag). I want to leave the TD's within the TBODY as is.
Here is my code to narrow down to the TD's within the THEAD. That's where I get lost.
How would I change the tag names in THEAD to TH?
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML(html);
// Get theads
$theads = $document->getElementsByTagName('thead');
// Loop through theads (incase there are more than one!)
for($i=0;$i<$theads->length;$i++) {
$thead = $theads->item($i);
// Loop through TR
foreach ($thead->childNodes AS $tr) {
if ($tr->nodeName == 'tr') {
// Loop through TD
foreach ($tr->childNodes AS $td) {
if ($td->nodeName == 'td') {
// Replace this tag
}
}
}
}
}

If you have checked the manual, there's this ->replaceChild() method you can use to replace td to th tags:
$html = '<table>
<thead>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</thead>
<tbody>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
<tr>
<td>Column 1</td>
<td>Column 2</td>
<td>Column 3</td>
</tr>
</tbody>
</table>';
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML($html);
// Get theads
$theads = $document->getElementsByTagName('thead')->item(0); // get thead tag
foreach($theads->childNodes as $tr) { // loop thead rows `tr`
$tds = $tr->getElementsByTagName('td'); // get tds inside trs
$i = $tds->length - 1;
while($i > -1) {
$td = $tds->item($i); // td
$text = $td->nodeValue; // text node
$th = $document->createElement('th', $text); // th element with td node value
$td->parentNode->replaceChild($th, $td); // replace
$i--;
}
}
echo $document->saveHTML();
Doc notes
Sample Output

Need help with PHP DOM XPath parsing table

I just recently read about the DOM module in PHP and now I'm trying to use it for parsing a HTML document. The page said that this was a much better solution than using preg but I'm having a hard time figuring out how to use it.
The page contains a table with dates and X number of events for the date.
First I need to get the text (a date) from a tr with valign="bottom" and then I need to get all the column values from all the tr with valign="top" who is below that tr. I need all the column values from each tr below the tr with the date up until the next tr with valign="bottom" (next date). The number of tr with column data is unknown, can be zero or a lot of them.
This is what the HTML on the page looks like:
<table>
<tr valign="bottom">
<td colspan="4">2009-02-26</td>
</tr>
<tr valign="top">
<td>21:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>23:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="bottom">
<td colspan="4">2009-02-27</td>
</tr>
<tr valign="top">
<td>06:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>10:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
<tr valign="top">
<td>13:00</td>
<td>Column data</td>
<td>Column data</td>
<td>Column data</td>
</tr>
</table>
So far I've been able to get the first two dates (I'm only interested in the first two) but I don't know how to go from here.
The xpath query I use to get the date trs is
$result = $xpath->query('//tr[#valign="bottom"][position()<3]);
Now I need a way to connect all the events for that day to the date, ie. select all the tds and all the column values up until the next date tr.

$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();
$html = new DOMDocument();
$html->loadHtmlFile('http://url/table.html');
$xpath = new DOMXPath( $html );
$elements = $xpath->query( "//table/tr" );
foreach ( $elements as $item ) {
$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($item,true));
$xpath = new DOMXPath( $newDom );
foreach ($item->attributes as $attribute) {
for ($node = $item->firstChild; $node !== NULL;
$node = $node->nextSibling) {
if (($attribute->nodeName =='valign') && ($attribute->nodeValue=='top'))
{
print($node->nodeValue);
}
else
{
print("<br>".$node->nodeValue);
}
}
print("<br>");
}
}
libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );

Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
Use following-sibling().

This XPath expression
/table/tr/td[#colspan=4]
or
/table/tr[valign='bottom']/td
Result in a node set with date cells.
How to get cells between marks?
/table/tr/td[not(#colspan=4)][preceding::td[#colspan=4][1]='2009-02-26']

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMXPath Match text - php

First, find the <tr> that has a <td> child with "Text-2", then traverse down the next sibling until you find a <td> that has "Blah 1". The answer is the next sibling of that. //tr[contains(td/text(), "Text-2")]/following-sibling::tr/td[contains(text(), "Blah 1")]/following-sibling::td'

Found the answer: //*[text()='Text-2']/following::td[text()='Blah 1']/following-sibling::td

Related

PHP function that removes anything else but the specified part of string

Php Dom getting nodeValue without stripping inline element tags

Remove empty tags with line breaks from HTML

PHP DOM - Replacing TD tags in a THEAD to TH tags

Need help with PHP DOM XPath parsing table

Categories

Resources