<?php
// load SimpleXML
$entry = new SimpleXMLElement('http://bit.ly/c3IqMF', null, true);
echo <<<EOF
<table>
<tr>
<th>Title</th>
<th>Image</th>
</tr>
EOF;
foreach($entry as $item) //
{
echo <<<EOF
<tr>
<td>{$item->title}</td>
<td><img src="{$item->children('im', true)->image}"></td>
</tr>
EOF;
}
echo '</table>';
?>
The above php works but somehow, I got 8 empty table entities above the result
<tr>
<td></td>
<td><img src=""></td>
</tr>
What's wrong with the code? How do I get rid of the empty table entities?
The way you have it now it gets the <id>, <title>, <updated> from the the start of the xml. Actually you needed all the entry entries in the xml. So it should be $entry->entry
foreach($entry->entry as $item) //
{
echo <<<EOF
<tr>
<td>{$item->title}</td>
<td><img src="{$item->children('im', true)->image}"></td>
</tr>
EOF;
}
Honestly, I think you are approaching this the wrong way. Since it seems that you are trying to parse an Atom feed, try using something designed for that, like Magpie RSS. It will probably save you a lot of time.
Related
I am trying to parse html table in order to get <td> ID HERE </td> tag content using Xpath and PHP.
Executing following line
$doc->loadHTMLFile($file);
gives me warnings like this:
PHP Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : tr in...
That's why I am using the following block of code:
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
Trying to parse this: (the entire page here)
<table class="object-table" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th width="8%">something here</th>
<th width="89%">something here</th>
<th width="3%">something here</th>
</tr>
<tr class="normal-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
<tr class="odd-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
</tbody>
</table>
with the following code:
$file = "http://www.sportsporudy.gov.ua/catalog/#c[1]=1";
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$query = '//tr[#class="odd-row"]';
$elements = $xpath->query($query);
printf("Size of array: %d\n", sizeof($elements));
printElements($elements);
and tried using different queries like
//table[#class="object-table"]/tbody/tr ...
but doesn't seem to give me the td tags I need. Maybe that's because of the broken HTML.
Thanks for your advice.
Substantially, your code is fine.
The only error that I've found is in the printing $elements length: $elements is not an array, to retrieve its length you have to use this syntax:
printf( "Size of array: %d\n", $elements->length );
But the major problem that you have with your page is that the HTML has only one table with one row: the remaining data are filled with javascript, so you can't retrieve it directly through DOMXPath.
why does find('tr')[0]; get table row 2 instead of table row 1 ?
This is my html all tables have the same class and layout.
<table class="tablemenu">
<tbody>
<tr>
<td><b>hello</b></td>
<td><b>hi</b></td>
</tr>
<tr>
<td>hey</td>
<td>Alright</td>
<td>Good</td>
<td>Good</td>
<td><a>Date</a></td>
</tr>
</tbody>
</table>
<table class="tablemenu">
<tbody>
<tr>
<td><b>hello</b></td>
<td><b>hi</b></td>
</tr>
<tr>
<td>hey</td>
<td>Alright</td>
<td>Good</td>
<td>Good</td>
<td><a>Date</a></td>
</tr>
</tbody>
</table>
<table class="tablemenu">
<tbody>
<tr>
<td><b>hello</b></td>
<td><a>hi</a></td>
</tr>
<tr>
<td>hey</td>
<td>Alright</td>
<td>Good</td>
<td>Good</td>
<td><a>LINK</a></td>
</tr>
</tbody>
</table>
This is my php
<?php
include("simpleHtmlDom/simple_html_dom.php");
$html = new simple_html_dom();
// Load a file
$html->load_file('http://mySite.net/');
foreach($html->find('table[class=tablemenu]') as $element){
$Link = $element->find('tr')[0]->find('td')[4]->find('a')[0];
echo($Link->text());
echo '<br />';
}
?>
At first to get the word 'Date' i tried
$Link = $element->find('tr')[1]->find('td')[4]->find('a')[0];
But that didn't work, it said undefined index.
Then i tried this just messing around and it works
$Link = $element->find('tr')[0]->find('td')[4]->find('a')[0];
This gets the word Date for some reason. I don't understand why, i do need that but
although it works - i now can't access table row 1. to grab the word say "hi".
I see two one issues:
Your first <tr> only has 2 <td>s, so $element->find('tr')[0]->find('td')[4] should throw an exception.
Edit OP fixed pasted code.
Fix your markup. You're not properly closing your <tr> elements:
<table class="tablemenu">
<tbody>
<tr>
<td><b>hello</b></td>
<td><b>hi</b></td>
</tr> <!-- close this! --->
<tr>
<td>hey</td>
<td>Alright</td>
<td>Good</td>
<td><a>Date</a></td>
</tr> <!-- close this! --->
</tbody>
</table>
There is wrong indexing because you are not closing the tr tags properly
the link should be on first index instead of zeroth index
$Link = $element->find('tr')[1]->find('td')[4]->find('a')[0];
To print hi try
echo $element->find('tr')[0]->find('td')[1]->find('b')[0]->text();
Full code
foreach($html->find('table[class=tablemenu]') as $element){
$Link = $element->find('tr')[1]->find('td')[4]->find('a')[0];
echo($Link->text());
echo '<br />';
echo $element->find('tr')[0]->find('td')[1]->find('b')[0]->text();
}
If the above not works then find tr in tbody like
$Link = $element->find('tbody')->find('tr')[1]->find('td')[4]->find('a')[0];
Also for debugging, try this
foreach($html->find('table[class=tablemenu]') as $element){
echo '<pre>';
var_dump($element);// find the object here
echo '</pre>';
}
ok i've been reading up on simple php html dom and so far it works great.
I have a table which i'm trying to convert to a mysql db.
I'm using this:
foreach($html->find('TR') as $row) {
etc.etc.etc.
}
my table:
<TR BGCOLOR="CCDDFF">
<TD valign="top">
</TD>
</TR>
but how do i get the bgcolor from the tr ?
Did you try the $row->getAttribute('bgcolor') method?
I am trying to get the text of child elements using the PHP DOM.
Specifically, I am trying to get only the first <a> tag within every <tr>.
The HTML is like this...
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
My sad attempt at it involved using foreach() loops, but would only return Array() when doing a print_r() on the $aVal.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(returnURLData($url));
libxml_use_internal_errors(false);
$tables = $dom->getElementsByTagName('table');
$aVal = array();
foreach ($tables as $table) {
foreach ($table as $tr){
$trVal = $tr->getElementsByTagName('tr');
foreach ($trVal as $td){
$tdVal = $td->getElementsByTagName('td');
foreach($tdVal as $a){
$aVal[] = $a->getElementsByTagName('a')->nodeValue;
}
}
}
}
Am I on the right track or am I completely off?
Put this code in test.php
require 'simple_html_dom.php';
$html = file_get_html('test1.php');
foreach($html->find('table tr') as $element)
{
foreach($element->find('a',0) as $element)
{
echo $element->plaintext;
}
}
and put your html code in test1.php
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
I am pretty sure I am late, but better way should be to iterate through all "tr" with getElementByTagName and then while iterating through each node in nodelist recieved use getElementByTagName"a". Now no need to iterate through nodeList point out the first element recieved by item(0). That's it! Another way can be to use xPath.
I personally don't like SimpleHtmlDom because of the loads of extra added features it uses where a small functionality is required. In case of heavy scraping also memory management issue can hold you back, its better if you yourself do DOM Analysis rather than depending thrid party application.
Just My opinion. Even I used SHD initially but later realized this.
You're not setting $trVal and $tdVal yet you're looping them ?
I am parsing and html dom string from ganon dom parser and want to get the next element plain text when a match is found on previous element e.g my html is like
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
I have used the following code for now
$html = str_get_dom('html string here');
foreach ($html('th.label') as $elem){
if($elem->getPlainText()=='SKU'){ //this is right
echo $elem->getSibling(1)->getPlainText(); // this is not working
}
}
If the th with class lable and innerhtml SKU is found then get the innerhtml from next sibling that is SKU value
Please help to sort this out.
It's probably a bug in "ganon" of the html - if you take your example of html:
$html = '<table>
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
</table>';
$html = str_get_dom($html);
for some reason because of the new line in the html "ganon" thinks that the next element is a text element and only then there is the desire td - so you have to do this:
foreach ($html('th.label') as $elem){
if($elem->getPlainText()=='SKU'){
//elem -> text node -> td node
echo($elem->getSibling(1)->getSibling(1)->getPlainText());
}
}
If you organize your html like this (without new line):
$html = '<table>
<tr class="last even">
<th class="label">SKU</th><td class="data last">some sku here i want to get </td>
</tr>
</table>';
Then your original code will work $elem->getSibling(1)->getPlainText()
Maybe consider using the php simple html dom class - it's much more intuitive, using full oop methods, jquery dom parser like and don't uses this awful var-function method :):
require('simple_html_dom.php');
$html = '<table>
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
</table>';
$dom = str_get_html($html);
foreach($dom->find('th.label') as $el){
if($el->plaintext == 'SKU'){
echo($el->next_sibling()->plaintext);
}
}