Using XPath to webscrape.
The structure is:
<table>
<tbody>
<tr>
<th>
<td>
but one of those tr has contains just one th or one td.
<table>
<tbody>
<tr>
<th>
So I just want to scrape if TR contains two tags inside it. I am giving the path
$route = $path->query("//table[count(tr) > 1]//tr/th");
or
$route = $path->query("//table[count(tr) > 1]//tr/td");
But it's not working.
I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.
https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html
$route = $path->query("//tr[count(*) >= 2]/th");
foreach ($route as $th){
$property[] = trim($th->nodeValue);
}
$route = $path->query("//tr[count(*) >= 2]/td");
foreach ($route as $td){
$value[] = trim($td->nodeValue);
}
I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH
This XPath,
//table[count(.//tr) > 1]/th
will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present).
This XPath,
//tr[count(*) > 1]/*
will select all children of tr elements with more than one child.
This XPath,
//tr[count(th) = count(td)]/*
will select all children of tr elements where the number of th children equals the number of td children.
OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace.
See How does XPath deal with XML namespaces?
If I understand correctly, you want th elements in trs that contain two elements? I think that this is what you need:
//th[count(../*) = 2]
I've included a more explicit path in my answer with a or statement to count TH and TD elements
$html = '
<html>
<body>
<table>
<tbody>
<tr>
<th>I am Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am ignored</th>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am also Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
</body>
</html>
';
$doc = new DOMDocument();
$doc->loadHTML( $html );
$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");
foreach( $result as $node )
{
var_dump( $doc->saveHTML( $node ) );
}
// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"
You can also use this for any depth descendants
//table[ count( descendant::td | descendant::th ) > 1]//tr
Change the xpath after the condition (square bracketed part) to change what you return.
Related
Well, I have a HTML File with the following structure:
<h3>Heading 1</h3>
<table>
<!-- contains a <thead> and <tbody> which also cointain several columns/lines-->
</table>
<h3>Heading 2</h3>
<table>
<!-- contains a <thead> and <tbody> which also cointain several columns/lines-->
</table>
I want to get JUST the first table with all its content. So I'll load the HTML File
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('http://www.example.com'));
libxml_clear_errors();
?>
All tables have the same classes and also have NO specific ID's. That's why the only way I could think of was to grab the h3-tag with the value "Heading 1". I already found this one, which works well for me. (Thinking of the fact that other tables and captions could be added leaves the solution as unfavorable)
How could I grab the h3 tag WITH the value "Heading 1"? + How could I select the following table?
EDIT#1: I don't have access to the HTML File, so I can't edit it.
EDIT#2: My Solution (thanks to Martin Henriksen) for now is:
<?php
$doc = new DOMDocument(1.0);
libxml_use_internal_errors(true);
$doc->loadHTML(file_get_contents('http://example.com'));
libxml_clear_errors();
foreach($doc->getElementsByTagName('h3') as $element){
if($element->nodeValue == 'exampleString')
$table = $element->nextSibling->nextSibling;
$innerHTML= '';
$children = $table->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
echo $innerHTML;
file_put_contents("test.xml", $innerHTML);
}
?>
You can Find any tag in HTML using simple_html_dom.php class you can download this file from this link https://sourceforge.net/projects/simplehtmldom/?source=typ_redirect
Than
<?php
include_once('simple_html_dom.php');
$htm = "**YOUR HTML CODE**";
$html = str_get_html($htm);
$h3_tag = $html->find("<h3>",0)->innertext;
echo "HTML code in h3 tag";
print_r($h3_tag);
?>
You can fetch out all the DomElements which the tag h3, and check what value it holds by accessing the nodeValue. When you found the h3 tag, you can select the next element in the DomTree by nextSibling.
foreach($dom->getElementsByTagName('h3') as $element)
{
if($element->nodeValue == 'Heading 1')
$table = $element->nextSibling;
}
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.
my smarty code is like this i want to add new tr after adding two td $k is counter variable with this code i can not add new tr after every two td
{section name="sec" loop=$dataArray}
<tr>
{if ($k%2) == 0}
<td>{$dataArray[sec].itemNm}</td>
<td>{$dataArray[sec].rate}</td>
<td>{$dataArray[sec].unitId}</td>
<td>{$dataArray[sec].packing}</td>
</tr>
{/if}
{/section}
My Php Select Query is like
$selectdata = "SELECT *,itemNm FROM price
JOIN item ON item.itemId = price.itemId
WHERE price.companyId = ".$companyId;
$selectdataRes = mysql_query($selectdata);
while($dataRow = mysql_fetch_array($selectdataRes))
{
$dataArray[$k]['priceId'] = $dataRow['priceId'];
$dataArray[$k]['itemNm'] = $dataRow['itemNm'];
$dataArray[$k]['rate'] = $dataRow['rate'];
$dataArray[$k]['unitId'] = $dataRow['unitId'];
$dataArray[$k]['packing'] = $dataRow['packing'];
$k++;
}
You increment $k every loop of mysql_fetch_array, which means $k is the number of the row (aka TR in your HTML). If your SQL query returns 4 lines, each containing a priceId, itemNm, rate, unitId and packing.
Normally in your HTML, to represent it in a common table, you would have 4 lines TR (one TR for each row) with a column TD for each data you want to display (one TD for each data).
{section name="sec" loop=$dataArray}
<tr>
<td>{$dataArray[sec].itemNm}</td>
<td>{$dataArray[sec].rate}</td>
<td>{$dataArray[sec].unitId}</td>
<td>{$dataArray[sec].packing}</td>
</tr>
{/section}
If you perform a $k%2 == 0, you reach it every two rows (every two TR), not every two TD. If you want to close TR and open new TR every two TD and not every two TR, you have to handle your TDs in a loop and start another incrementing variable like this (example neither with smarty nor in any language in particular, just an idea of algorithm) :
for($k=0;$k<$numLines;$k++)
{
<tr>
for($l=0;$l<$numColumns;$l++)
{
if($l > 0 && $l%2 == 0)
{
</tr><tr>
}
<td>$myData[$l]</td>
}
</tr>
}
Hoping it helps :)
Your question is very hard to understand...
Do you want to split every data row in two table rows?
Since you have a static template, you can add table rows inside your loop, whereever you want. I see no need to use multiple loops.
{section name="sec" loop=$dataArray}
<tr>
<td>{$dataArray[sec].itemNm}</td>
<td>{$dataArray[sec].rate}</td>
</tr>
<tr>
<td>{$dataArray[sec].unitId}</td>
<td>{$dataArray[sec].packing}</td>
</tr>
{/section}
if this does not fit your needs, maybe could you provide an example output as you need it?
We can use the childNodes property,and item() function to locate a child element based on a parent node, however, if the path between the parent and the child is too long, there might have be too many childNodes->item() that is needed to be writing, like the PHP I listed below, I want to find the content inside the P tag, based on node Table, you can check the variable $sentences, I don't know if this is the only way to deal with such situation, is there any better way to do this?
HTML
<table>
<tr>
<td>
<p>Sentence 1</p>
</td>
<tr>
<tr>
<td>
<a>Click 1</a>
</td>
<tr>
</table>
<table>
<tr>
<td>
<p>Sentence 2</p>
</td>
<tr>
<tr>
<td>
<a>Click 2</a>
</td>
<tr>
</table>
Here is my PHP code
$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table');
foreach($table_list as $k=>$table_node){
$sentence1 = $table_node->childNodes->item(0)->childNodes->item(0)->childNodes->item(0)->nodeValue;
}
I need to get the value inside P tag, you can see the code is really long to just get the value inside the P tag, is there any way I can shorten the code, like using
$sentence1 = $table_node->FIND('//tr/td/p')
Instead of writing childNodes->item() repeatedly?
$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table/tr/td/p');
$sentences = array();
foreach($table_list as $k=>$table_node){
$sentences[] = $table_node->nodeValue;
}
If I understand you correctly, you can simply do this: $xpath->query('/table//p'). It will return all p elements that are a direct or indirect child of the table element at the root.
in a simple HTML table I would like to remove the last column
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
<th rowspan="3">I want to remove this</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td rowspan="3">I want to remove this</td>
</tr>
I am using this code, but I am still left with the content and the th and td rowspan
$myTable = preg_replace('#</?td rowspan[^>]*>#i', '', $myTable);
echo $myTable
Question: how do I remove the last column and it's content ?
<?php
// Create a new DOMDocument and load the HTML
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
// Create a new XPath query
$xpath = new DOMXPath($dom);
// Find all elements with a rowspan attribute
$result = $xpath->query('//*[#rowspan]');
// Loop the results and remove them from the DOM
foreach ($result as $cell) {
$cell->parentNode->removeChild($cell);
}
// Save back to a string
$newhtml = $dom->saveHTML();
See it working
I guess this will do it
preg_replace("/<(?:td|th)[^>]*>.*?<\/(?:td|th)>\s+<\/tr>/i", "</tr>", $myTable);
assuming you have closing tr tag (</tr>) at the end of each row unlike in your example
Edit: this will remove any <td> or <th> elements before closing </tr> no matter if they have any attributes
working example