Locating TD position by it's TH, domCrawler - php

I am trying to scrape table's td tag, but first I need to check th. For example let say table structure is like below.
<tbody>
<tr>
<th>color</th>
<td>red</td>
</tr>
<tr>
<th>price</th>
<td>23.267$</td>
</tr>
<tr>
<th>brand</th>
<td>mustang</td>
</tr>
</tbody>
In this table I need to scrape mustang value. But I can't use $crawler->filter('table td')->eq(3); for that. Because position is always changing. So I need to catch the value by it's th. I mean if th's value is brand then get it's td
what is the best way to this?

Not sure it's a best solution, but I solved it with this:
$props = $node->filter("table th")->each(function($th, $i){
return $th->text();
});
$vals = $node->filter("table td")->each(function($td, $i){
return $td->text();
});
$items = [
"brand" => "",
"color" => "",
];
for ($a=0; $a < count($props); $a++) {
switch ($props[$a]) {
case 'brand':
$items["brand"] = $vals[$a];
break;
}
}
If there is another way or much better way to achieve this. Please feel free to post it here. Thank you.

Related

XPath for td/th based on tr count

Using XPath to webscrape.
The structure is:
<table>
<tbody>
<tr>
<th>
<td>
but one of those tr has contains just one th or one td.
<table>
<tbody>
<tr>
<th>
So I just want to scrape if TR contains two tags inside it. I am giving the path
$route = $path->query("//table[count(tr) > 1]//tr/th");
or
$route = $path->query("//table[count(tr) > 1]//tr/td");
But it's not working.
I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.
https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html
$route = $path->query("//tr[count(*) >= 2]/th");
foreach ($route as $th){
$property[] = trim($th->nodeValue);
}
$route = $path->query("//tr[count(*) >= 2]/td");
foreach ($route as $td){
$value[] = trim($td->nodeValue);
}
I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH
This XPath,
//table[count(.//tr) > 1]/th
will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present).
This XPath,
//tr[count(*) > 1]/*
will select all children of tr elements with more than one child.
This XPath,
//tr[count(th) = count(td)]/*
will select all children of tr elements where the number of th children equals the number of td children.
OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace.
See How does XPath deal with XML namespaces?
If I understand correctly, you want th elements in trs that contain two elements? I think that this is what you need:
//th[count(../*) = 2]
I've included a more explicit path in my answer with a or statement to count TH and TD elements
$html = '
<html>
<body>
<table>
<tbody>
<tr>
<th>I am Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am ignored</th>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am also Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
</body>
</html>
';
$doc = new DOMDocument();
$doc->loadHTML( $html );
$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");
foreach( $result as $node )
{
var_dump( $doc->saveHTML( $node ) );
}
// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"
You can also use this for any depth descendants
//table[ count( descendant::td | descendant::th ) > 1]//tr
Change the xpath after the condition (square bracketed part) to change what you return.

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

Parse json and put into many tables

I was trying to print this json in some tables but i can't do it well, i hope you guys can help me, this is the json that i get by AJAX
<?php
$json = array(
'teams'=>array(
array(
'item'=>'tabla_clasif',
'rows'=>array(
array('No'=>'1','logo1'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'2','logo2'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'3','logo3'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'4','logo4'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'5','logo5'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'6','logo6'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'7','logo7'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'8','logo8'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'9','logo9'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'10','logo10'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'11','logo11'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'12','logo12'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'13','logo13'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'14','logo14'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'15','logo15'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12'),
array('No'=>'16','logo16'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','pg'=>'3','pe'=>'0','pp'=>'0','gf'=>'9','gc'=>'2','dg'=>'7','pt'=>'12')
)
),
array(
'item'=>'goles_marca',
'rows'=>array(
array('No'=>'1','logo1'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'2','logo2'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'3','logo3'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'4','logo4'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'5','logo5'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'6','logo6'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'7','logo7'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'8','logo8'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'9','logo9'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'10','logo10'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'11','logo11'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'12','logo12'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'13','logo13'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'14','logo14'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'15','logo15'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3'),
array('No'=>'16','logo16'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gf'=>'3')
)
),
array(
'item'=>'goles_recib',
'rows'=>array(
array('No'=>'1','logo1'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'2','logo2'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'3','logo3'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'4','logo4'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'5','logo5'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'6','logo6'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'7','logo7'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'8','logo8'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'9','logo9'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'10','logo10'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'11','logo11'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'12','logo12'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'13','logo13'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'14','logo14'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'15','logo15'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3'),
array('No'=>'16','logo16'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','gc'=>'3')
)
),
array(
'item'=>'efect_gol',
'rows'=>array(
array('No'=>'1','logo1'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'2','logo2'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'3','logo3'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'4','logo4'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'5','logo5'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'6','logo6'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'7','logo7'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'8','logo8'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'9','logo9'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'10','logo10'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'11','logo11'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'12','logo12'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'13','logo13'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'14','logo14'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'15','logo15'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3'),
array('No'=>'16','logo16'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','eg'=>'3')
)
),
array(
'item'=>'remate_total',
'rows'=>array(
array('No'=>'1','logo1'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'2','logo2'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'3','logo3'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'4','logo4'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'5','logo5'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'6','logo6'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'7','logo7'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'8','logo8'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'9','logo9'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'10','logo10'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'11','logo11'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'12','logo12'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'13','logo13'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'14','logo14'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'15','logo15'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3'),
array('No'=>'16','logo16'=>'images/lobos.png','team1'=>'atlante','pj'=>'3','rt'=>'3')
)
),
)
);
echo json_encode($json);
?>
I need to put them into tables, each table for each item, i have tried with some each until i get every or row but it was really hard
What is the best way to make it?
onSuccess : function(data) {}
A example of the HTML I am looking for would be:
<table>
<thead>
<tr>
<td>No</td>
<td>Logo</td>
<td>Team</td>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><img src="images/lobos.png" /></td>
<td>atlante</td>
</tr>
</tbody>
</table>
Here is another suggestion, using the Element constructor. You should also take a look at MooTools More and the HTML table, probably a good alternative here if you can re-format the JSON a bit.
Anyway, I would do what you are looking for this way:
json = JSON.parse(json);
json.teams.each(function (team) {
var newTable = new Element('table', {
class: 'hidden myWidget'
}).inject(document.body);
var thead = new Element('thead').inject(newTable);
var tittleRow = new Element('tr');
for (var title in team.rows[0]) new Element('td', {
'html': title
}).inject(tittleRow);
tittleRow.inject(thead);
var tbody = new Element('tbody').inject(newTable);
team.rows.each(function (row) {
var newRow = new Element('tr');
for (var value in row) new Element('td', {
'html': row[value]
}).inject(newRow);
newRow.inject(tbody);
});
});
Example: http://jsfiddle.net/2KjSn/
Ps. You should post your answers when you find a solution yourself and not edit the question. This way it might be useful for others also.

Digging deeper into DOMElement

I've used Zend_Dom_Query to extract some <tr> elements and I want to now loop through them and do some more. Each <tr> looks like this, so how can I print the title Title 1 and the id of the second td id=categ-113?
<tr class="sometr">
<th><a class="title">Title1</a></th>
<td class="category" id="categ-113"></td>
<td class="somename">Title 1 name</td>
</tr>
You should just play around with the results. I've never worked with it, but this is how far i got (and im kinda new to Zend myself):
$dom = new ZEnd_Dom_Query($html);
$res = $dom->query('.sometr');
foreach($res as $dom) {
$a = $obj->getElementsByTagName('a');
echo $a->item(0)->textContent; // the title
}
And with this i think you're set to go. For further information and functions to be used of the result look up DOMElement ( http://php.net/manual/de/class.domelement.php ). With this information you should be able to grab all that. But my question is:
Why doing this so complicated, i don't really see a use-case for doing this. As the title and everything else should be something coming from the database? And if it's an XML there's better solutions than relying on Dom_Query.
Anyways, if this was helpful to you please accept and/or vote the answer.

php: parsing table structure with SimpleXML

I'm trying to read in an xml file that for some reason has been modeled in a table structure like so:
<tr id="1">
<td name="Date">10/01/2009</td>
<td name="PromoName">Sample Promo Name</td>
<td name="PromoCode">Sample Promo Code</td>
<td name="PromoLevel" />
</tr>
This is just one sample row, the file has multiple <tr> blocks and it's all surrounded by <table>.
How can I read in the values, with all of the lines being named <td> name?
You could use simpleXML with an XPath expression.
$xml = simplexml_load_file('myFile.xml');
$values = $xml->xpath('//td[#name]');
foreach($values as $v) {
echo "Found $v<br />";
}
This would give you all the TD node values that have a name attribute, e.g.
Found 10/01/2009
Found Sample Promo Name
Found Sample Promo Code
Found <nothing cuz PromoLevel is empty>
Edit To get through all the Table Rows, you could do something like this:
$rows = $xml->xpath('//tr');
foreach($rows as $row) {
echo $row['id'];
foreach($row->td as $td) {
if($td['name']) {
echo $td['name'],':',$td,'<br/>',PHP_EOL;
}
}
}
You might also want to have a look at this article.
Edit Fixed the XPath expression, as Josh suggested.

Categories