Using Xpath expression and checking element's count - php

I am web scraping a table with Xpath and matching TR's tds but the problem in this situation. some of TR has one td so I need to eliminate those. But with that elimination I am having a quiet problem.
For example:
$getTR = $path->query("//table[#class='bgc_line']/tr");
foreach($getTR as $tr){
if ($tr->length == 2) {
$route = $path>query("//table[#class='bgc_line']/tr/td[1]");
foreach ($route as $td1) {
$property[] = trim($td1->nodeValue);
}
$route = $path->query("//table[#class='bgc_line']/tr/td[2]");
foreach ($route as $td2) {
$value[] = trim($td2->nodeValue);
}
}
}
So my usage of if isn't exactly right. But is there other way to do this? Because I have two expression and first Xpath's count is different then second. That's why I can't match the Data with each other. you can see the table here.

You can use
table[#class='bgc_line']/tr[count(td) > 1]
To get table rows only if they have more than one child td

Related

PHP Simple Dom - Loop through one class at a time?

Is there any way to do this? currently I'm using this method to go through the td's for every table in a wikipedia entry that have a class of wikitable.
foreach ($html->find('table.wikitable td') as $key => $info)
{
$text = $info->innertext;
}
However, what I want to do is have seperate loops for each table that share the class name of wikitable. I can't figure out how to do this.
Is there some kind of syntax? I want to do something like this
$first_table = $html->find('table.wikitable td', 0); // return all the td's for the first table
$second_table = $html->find('table.wikitable td', 1); // second one
I might not fully understand your question but it seems that $html->find simply returns an array, in your case an array of tables:
$tables = $html->find('table.wikitable');
You can then loop through your tables and find the td's in each table:
foreach( $tables as $table )
{
$tds = $table->find('td');
foreach( $tds as $td )
{
...
}
}
If you only want to target the second table you can use:
$table = $tables[1];
Or something like that.

php domdocument parse nested tables

I got a table which looks like this: http://pastebin.com/jjZxeNHF
I got it as a PHP-DOMDocument.
Now I want to "parse" this table.
If I am correct, something like the following is not going to work because $superTable->getElementsByTagName('tr') is not only going to get outer tr's but also the inner ones.
foreach ($superTable->getElementsByTagName('tr') as $superRow) {
foreach ($superRow->getElementsByTagName('td') as $superCol) {
foreach ($superCol->getElementsByTagName('table') as $table) {
foreach ($table->getElementsByTagName('tr') as $row) {
foreach ($row->getElementsByTagName('td') as $col) {
}
}
}
}
}
How can I go trough all the tables, field by field, as described in the second snippet.
This is my solution:
foreach ($raumplan->getElementsByTagName('tr') as $superRow) {
if ($superRow->getElementsByTagName('table')->length > 0) {
foreach ($superRow->getElementsByTagName('td') as $superCol) {
if ($superCol->getElementsByTagName('table')->length > 0) {
foreach ($superCol->getElementsByTagName('table') as $table) {
foreach ($table->getElementsByTagName('tr') as $row) {
foreach ($row->getElementsByTagName('td') as $col) {
}
}
}
}
}
}
}
It checks if you are in the outer table by looking if there is a table contained in the element.
You could use XPath to eliminate a lot of the blatantly low-level iteration and reduce the apparent complexity of all this...
$xpath = new DOMXPath($document);
foreach ($xpath->query('//selector/for/superTable//table') as $table) {
// in case you really wanted them...
$superCol = $table->parentNode;
$superRow = $superCol->parentNode;
foreach ($table->getElementsByTagName('td') as $col) {
$row = $td->parentNode;
// do your thing with each cell here
}
}
You could drill down further than this, if you wanted -- if you just wanted every cell in the inner tables, you could reduce it to one loop over //selector/for/superTable//table//td.
Course, if you're dealing with valid HTML, then you could just loop over each element's children as well. It all depends on what the HTML will look like, and exactly what you need from it.
Edit: If you can't use XPath for some reason, you might could do something like
// I assume you've found $superTable already
foreach ($superTable->getElementsByTagName('table') as $table) {
$superCol = $table->parentNode;
$superRow = $superCol->parentNode;
foreach ($table->getElementsByTagName('td') as $col) {
$row = $col->parentNode;
// do your thing here
}
}
Note that neither solution bothers to iterate over the rows etc. That's a big part of what obviates the need to get only rows in the current table. You're only looking for tables within the table, which by definition (1) will be the sub-tables and (2) will be within a column within a row within the main table, and you can get the parent row and column from the table element itself.
Of course, both solutions assume you're only nesting tables one level deep. If it's more than that, you're going to want to look at a recursive solution and DOMElement's childNodes property. Or, a more narrowly focused XPath query.

Simple html dom - all tr except the first one

I would like to find all <tr> starting from the second, but i don't know how to get it right..
$items = $html->find('tr');
That piece of code gets all trs but i want everyone except the first one because that one contains <th>.
Just cut off the first element.
$items = array_slice($html->find('tr'), 1)
When you get your list with $html->find('tr'); make a loop that don't care of the first "index/row".
if Simple html dom work like Jquery, try to use like this:
$items = $html->find('tr:not(:has(th)');
As PoulsQ suggests you CAN do it like
$firstTr = true;
foreach($html->find('tr') as $tr) {
if(!$firstTr) {
// YOUR LOGIC FOR A TR HERE
}
else {
$firstTr = false;
}
}
But I think it would be nicer code if you query the DOM to ignore the first element.
You can get all trs from $html->find('tr'); from this u can add the condition to ignore the if the next element for object is "th" tag then u can ignore that tr.

XPath query is sometimes not showing the right elements

I am using XPath, and this is my query:
$elements = $xpath->query('//div/div/div/div/div/div[#id="con1"]/table/tr/td');
And everything works fine.
Then I change the condition in the div, and the query is like this:
$elements = $xpath->query('//div/div/div/div/div/div[#id="con2"]/table/tr/td');
And I do see what I must see.
But later, if I do this:
$elements = $xpath->query('//div/div/div/div/div/div[#id="con1" or #id="con2"]/table/tr/td');
I see again only the elements of con1. Why is that?
The full code is below:
$elements = $xpath->query('//div/div/div/div/div/div[#id="con1" or #id="con2"]/table/tr/td');
foreach ( $elements as $element ) {
$str1=$element->getAttribute('class');
$str2="first-td";
$str3="status";
if (strcmp($str1,$str2)==0) {
var_dump( $element->nodeValue);
}
if (strcmp($str1,$str3)==0) {
echo $element->childNodes->item(0)->getAttribute('class'). "<br />";
}
}
To sum up: If my condition is only con1, I see the correct results. If it's only con2, I see the correct results. The problem comes when I am using the or. In that case, I see the results only from con1. It's like it's stopping after fullfilling the first condtions. They are at the same level of the DOM tree.
What you are trying to do is to retrieve <div id="con1"> and <div id="con2"> in the same expression, but what you are actually doing is to retrieve a div which either has an attribute id="con1" or id="con2". The first expression of the condition returns true and then you get the <div id="con1"> node. It makes sense.
To get both nodes you need something like:
//div[#id="con1"]|//div[#id="con2"
Note: //div[#id="con1"] finds whatever node <div id="con1"> in the tree and the id in a document has to be unique. It's not necessary to specify all the path down.

Accounting for missing array keys, within PHP foreach loop

I'm parsing a document for several different values, with PHP and Xpath. I'm throwing the results/matches of my Xpath queries into an array. So for example, I build my $prices array like this:
$prices = array();
$result = $xpath->query("//div[#class='the-price']");
foreach ($result as $object) {
$prices[] = $object->nodeValue; }
Once I have my array built, I loop through and throw the values into some HTML like this:
$i = 0;
foreach ($links as $link) {
echo <<<EOF
<div class="the-product">
<div class="the-name"><a title="{$names[$i]}" href="{$link}" target="blank">{$names[$i]}</a></div>
<br />
<div class="the-image"><a title="{$names[$i]}" href="{$link}" target="blank"><img src="{$images[$i]}" /></a></div>
<br />
<div class="the-current-price">Price is: <br> {$prices[$i]}</div>
</div>
EOF;
$i++; }
The problem is, some items in the original document that I'm parsing don't have a price, as in, they don't even contain <div class='the-price'>, so my Xpath isn't finding a value, and isn't inserting a value into the $prices array. I end up returning 20 products, and an array which contains only 17 keys/values, leading to Notice: Undefined offset errors all over the place.
So my question is, how can I account for items that are missing key values and throwing off my arrays? Can I insert dummy values into the array for these items? I've tried as many different solutions as I can think of. Mainly, IF statements within my foreach loops, but nothing seems to work.
Thank you
I suggest you look for an element inside your html which is always present in your "price"-loop. After you find this object you start looking for the "price" element, if there is none, you insert an empty string, etc. into your array.
Instead of directly looking for the the-price elements, look for the containing the-product. Loop on those, then do a subquery using those nodes as the starting context. That way you get all of the the-product nodes, plus the prices for those that have them.
e.g.
$products = array();
$products = $xpath->query("//div[#class='the-product']");
$found = 0 ;
foreach ($products as $product) {
$products[$found] = array();
$price = $xpath->query("//div[#class='the-price']", $product);
if ($price->length > 0) {
$products[$found] = $price->item(0)->nodeValue;
}
$found++;
}
If you don't want to show the products that don't have a price attached to them you could check if $prices[$i] is set first.
foreach($links AS $link){
if(isset($prices[$i])){
// echo content
}
}
Or if you wanted to fill it will dummy values you could say
$prices = array_merge($prices,
array_fill(count($prices), count($links)-count($prices),0));
And that would insert 0 as a dummy value for any remaining values. array_fill starts off by taking the first index of the array (so we start one after the amount of keys in $prices), then how many we need to fill, so we subtract how many are in $prices from how many are in $links, then we fill it with the dummy value 0.
Alternatively you could use the same logic in the first example and just apply that by saying:
echo isset($prices[$i]) ? $prices[$i] : '0';
Hard to understand the relation between $links and $prices with the code shown. Since you are building the $prices array without any relation to the $links array, I don't see how you would do this.
Is $links also built via xpath? If so, is 'the-price' div always nested within the DOM element used to populate $links?
If it is you could nest your xpath query to find the price within the query used to find the links and use a counter to match the two.
i.e.
$links_result = $xpath->query('path-to-link')
$i = 0
foreach ($links_result as $link_object) {
$links[$i] = $link_object->nodeValue;
// pass $link_object as context reference to xpath query looking for price
$price_result = $xpath->query('path-to-price-within-link-node', $link_object);
if (false !== $price_result) {
$prices[$i] = $price_result->nodeValue;
} else {
$prices[$i] = 0; // or whatever value you want to show to indicate that no price was available.
}
$i++;
}
Obviously, there could be additional handling in there to verify that only one price value exists per link node and so forth, but that is basic idea.

Categories