in a simple HTML table I would like to remove the last column
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
<th rowspan="3">I want to remove this</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td rowspan="3">I want to remove this</td>
</tr>
I am using this code, but I am still left with the content and the th and td rowspan
$myTable = preg_replace('#</?td rowspan[^>]*>#i', '', $myTable);
echo $myTable
Question: how do I remove the last column and it's content ?
<?php
// Create a new DOMDocument and load the HTML
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
// Create a new XPath query
$xpath = new DOMXPath($dom);
// Find all elements with a rowspan attribute
$result = $xpath->query('//*[#rowspan]');
// Loop the results and remove them from the DOM
foreach ($result as $cell) {
$cell->parentNode->removeChild($cell);
}
// Save back to a string
$newhtml = $dom->saveHTML();
See it working
I guess this will do it
preg_replace("/<(?:td|th)[^>]*>.*?<\/(?:td|th)>\s+<\/tr>/i", "</tr>", $myTable);
assuming you have closing tr tag (</tr>) at the end of each row unlike in your example
Edit: this will remove any <td> or <th> elements before closing </tr> no matter if they have any attributes
working example
Related
Using XPath to webscrape.
The structure is:
<table>
<tbody>
<tr>
<th>
<td>
but one of those tr has contains just one th or one td.
<table>
<tbody>
<tr>
<th>
So I just want to scrape if TR contains two tags inside it. I am giving the path
$route = $path->query("//table[count(tr) > 1]//tr/th");
or
$route = $path->query("//table[count(tr) > 1]//tr/td");
But it's not working.
I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.
https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html
$route = $path->query("//tr[count(*) >= 2]/th");
foreach ($route as $th){
$property[] = trim($th->nodeValue);
}
$route = $path->query("//tr[count(*) >= 2]/td");
foreach ($route as $td){
$value[] = trim($td->nodeValue);
}
I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH
This XPath,
//table[count(.//tr) > 1]/th
will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present).
This XPath,
//tr[count(*) > 1]/*
will select all children of tr elements with more than one child.
This XPath,
//tr[count(th) = count(td)]/*
will select all children of tr elements where the number of th children equals the number of td children.
OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace.
See How does XPath deal with XML namespaces?
If I understand correctly, you want th elements in trs that contain two elements? I think that this is what you need:
//th[count(../*) = 2]
I've included a more explicit path in my answer with a or statement to count TH and TD elements
$html = '
<html>
<body>
<table>
<tbody>
<tr>
<th>I am Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am ignored</th>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am also Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
</body>
</html>
';
$doc = new DOMDocument();
$doc->loadHTML( $html );
$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");
foreach( $result as $node )
{
var_dump( $doc->saveHTML( $node ) );
}
// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"
You can also use this for any depth descendants
//table[ count( descendant::td | descendant::th ) > 1]//tr
Change the xpath after the condition (square bracketed part) to change what you return.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.
I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title
Here is my php code
$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");
/*What we do here in this loop is retrieve all content inside the encoded content,
*which includes the CDATA information. This is where the HTML and styling is included.
*/
foreach($xml->channel->item as $cont){
$html=''.$cont->children('content',true)->encoded.'<br>'; //actual tag name is encoded
}
$htmlParser= new DOMDocument(); //to parse html using DOMDocument
libxml_use_internal_errors(true); // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html); //Loaded the html string we took from simple xml
$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
echo $cols;
}
This is the HTML I am extracting info from
<table cellpadding='1' cellspacing='2'>
<tr>
<td><b>Job Title:</b></td>
<td>Job Example </td>
</tr>
<tr>
<td><b>Job ID:</b></td>
<td>23992</td>
</tr>
<tr>
<td><b>Job Description:</b></td>
<td>Just a job example </td>
</tr>
<tr>
<td><b>Job Category:</b></td>
<td>Work-study Position</td>
</tr>
<tr>
<td><b>Position Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Applicant Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>Active</td>
</tr>
<tr>
<td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
</tr>
</table>
You can use xpath to query('//td') and retrieve the td html using C14N(), something like:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
Output:
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
C14N();
Returns canonicalized nodes as a string or FALSE on failure
Update:
Another question, how can I grab individual Table Data? For example,
just grab, Job ID
Use XPath contains, i.e.:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
Update V2:
How can I get the next Table Data after that (to actually get the Job
Id) ?
Use following-sibling::*[1], i.e:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
echo $tableDataNodes[$x];
}
We can use the childNodes property,and item() function to locate a child element based on a parent node, however, if the path between the parent and the child is too long, there might have be too many childNodes->item() that is needed to be writing, like the PHP I listed below, I want to find the content inside the P tag, based on node Table, you can check the variable $sentences, I don't know if this is the only way to deal with such situation, is there any better way to do this?
HTML
<table>
<tr>
<td>
<p>Sentence 1</p>
</td>
<tr>
<tr>
<td>
<a>Click 1</a>
</td>
<tr>
</table>
<table>
<tr>
<td>
<p>Sentence 2</p>
</td>
<tr>
<tr>
<td>
<a>Click 2</a>
</td>
<tr>
</table>
Here is my PHP code
$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table');
foreach($table_list as $k=>$table_node){
$sentence1 = $table_node->childNodes->item(0)->childNodes->item(0)->childNodes->item(0)->nodeValue;
}
I need to get the value inside P tag, you can see the code is really long to just get the value inside the P tag, is there any way I can shorten the code, like using
$sentence1 = $table_node->FIND('//tr/td/p')
Instead of writing childNodes->item() repeatedly?
$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table/tr/td/p');
$sentences = array();
foreach($table_list as $k=>$table_node){
$sentences[] = $table_node->nodeValue;
}
If I understand you correctly, you can simply do this: $xpath->query('/table//p'). It will return all p elements that are a direct or indirect child of the table element at the root.
I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.
Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php