<tr class='Jed01'>
<td height='20' class='JEDResult'>1</td>
<td height='30' class='JEDResult'>26.04.2013</td>
<td height='30' class='JEDResult'>19:43</td>
<td height='30' class='JEDResult'>Processing</td>
<td height='30' class='JEDResult'><a href="#" pressed="GetInfo(1233);" title=''>Jeddah</a></td>
</tr>
Result = first step - date - time - state - place
First of all I am new to PHP and I am trying to parse this data to my web via PHP - DOM as recommended to me before on Stackoverflow. In the code below I have called all classes to get data but I can't get any result while there is no any issue. So please where it could be my issue?
Thanks from now
<?php
$input = "www.kalkatawi.com/luai.html"
$html = new DOMDocument();
$html->loadHTML($input);
foreach($html->getElementsByTagName('tr') as $tr)
{
if($tr->getAttribute('class') == 'Jed01')
{
foreach($html->getElementsByTagName('td') as $td)
{
if($td->getAttribute('class') == 'JEDResult')
{
echo ($td->nodeValue);
}
}
}
}
?>
Dont forget those semi colons ;)
Try this;
<?php
$input = file_get_contents("http://www.kalkatawi.com/luai.html");
$html = new DOMDocument();
$html->loadHTML($input);
foreach($html->getElementsByTagName('tr') as $tr)
{
if($tr->getAttribute('class') == 'Jed01')
{
foreach($tr->getElementsByTagName('td') as $td)
{
if($td->getAttribute('class') == 'JEDResult')
{
echo ($td->nodeValue);
echo '<br/>';
}
}
}
echo '<br/><br/>';
}
?>
Should output;
1
26.04.2013
19:43
Processing
Jeddah
2
26.04.2013
20:43
Printed
RIY
There are several problems with this code.
Loading the HTML
$input = 'MyLink';
$html = new DOMDocument();
$html->loadHTML($input);
This code attempts to treat the string 'MyLink' as HTML, which obviously it is not. If that's your actual code then nothing would work beyond this point. Either provide proper HTML input or use loadHTMLFile to load HTML from a file.
Comparisons are case-sensitive
On the one hand there is this:
<tr class='Jed01'>
And on the other this:
if($tr->getAttribute('class') == 'JED01')
Since 'Jed01' != 'JED01' this will never be true. Either fix the casing or use some other mechanism such as stricmp to compare the classes.
Objects cannot be printed
This results in a fatal error:
echo ($td);
What it should be instead: most likely echo $td->nodeValue, but other possibilities are open depending on what you want to do.
But you could do it much more easily with XPath
$xpath = new DOMXPath($html);
$query = "//tr[#class='Jed01']//td[#class='JEDResult']"; // google XPath syntax
foreach ($xpath->query($query) as $node) {
print_r($node->nodeValue);
}
Related
Trying to scrape data out of a table on a website. I got the following PHP written but it isn't working.
Following error received: Notice: Trying to get property of non-object in DataScraping.php on line 27
//Sets the HTML DOM Library
require_once 'C:/xampp/php/lib/SimpleHTMLDOM/simple_html_dom.php';
$html = new simple_html_dom();
$html = file_get_html('https://www.flightradar24.com/data/flights/british-airways-ba-baw');
foreach($html->find('table[id=tbl-datatable]') as $datatable) {
foreach($datatable->find('tr') as $tr) {
foreach($tr->find('td') as $td) {
if(strpos($td->find('a', 0)->href, 'https://www.flightradar24.com/data/flights/') !== false) {
echo $td->find('a', 0)->innertext .", " .$td->find('a', 0)->href;
}
}
}
}
Also worth mentioning, this data is publically available and it is only for personal use. Please don't comment about copyright infringement - there is nothing wrong with what I want to do.
I'm simply trying to scrape the flight number only, both the inner text and the URL that sites behind it. Any help on where I'm going wrong?
Additional test provides the data I need but with the same error in between rows:
foreach($html->find('table[id=tbl-datatable]') as $datatable) {
foreach($datatable->find('tr') as $tr) {
foreach($tr->find('td') as $td) {
if (strpos($td->find('a', 0)->href, '/data/flights/') !== false) {
$test = $td->find('a', 0)->href;
$test2 = $td->find('a', 0)->innertext;
echo $test .", " .$test2;
}
}
}
}
You're trying to access elements of a null reference in your if statement itself, because not all of the <TD> tags have <A> tags in them. When there's no <A> tag in $td, $td->find('a', 0) is null, so
$td->find('a', 0)->href
is just what your error message said: "trying to get [a] property of [a] non-object".
You can fix this by checking the result of find() for null with an if:
$atag = $td->find('a', 0)
if ($atag) {
// ...
}
And you can fold this into your single if statement with the && operator. You've got another couple problems I found when running your code:
in the source of that site, the hrefs in the table are all relative, not absolute, so when you check for 'https://www.flightradar24.com' you find none of them
you're not adding a newline at the end of your echo
So to summarize my suggestions, something like this seems to work:
foreach($tr->find('td') as $td) {
$atag = $td->find('a', 0);
if($atag && strpos($atag->href, '/data/flights/') !== false) {
echo $atag->innertext . ", " . $atag->href . "\n";
}
}
I'm having a problem with code working on one PHP install and not working on the other. Maybe one install is more forgiving when it comes to the error.
When I upload to production I receive the following error:
PHP Fatal error: Call to undefined method DOMText::getElementsByTagName() in ...
The line causing the error is:
$tds = $tr->getElementsByTagName('td');
I have a feeling the issue is related to calling the getElementsByTagName method from within DOMText:: instead of DOMDocument:: (the docs seem to make this obvious) but out of my lack of understanding of what I've done wrong, I am not sure how to address the issue.
Here's my code:
<?php
// The HTML
$table_html = '<table>
<thead>
<tr>
<td>AAA</td>
<td>BBB</td>
</tr>
</thead>
<tbody>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</tbody>
</table>';
// Create DOM Document
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->formatOutput = true;
#$document->loadHTML($table_html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOEMPTYTAG);
// Change TD's to TH's in THEAD's
$theads = $document->getElementsByTagName('thead')->item(0);
if ($theads) {
foreach($theads->childNodes AS $tr) {
$tds = $tr->getElementsByTagName('td'); // <---- This is where the error occurs
if ($tds->length > 0) {
$i = $tds->length - 1;
while($i > -1) {
$td = $tds->item($i); // td
$text = $td->nodeValue; // text node
$th = $document->createElement('th', $text); // th element with td node value
$td->parentNode->replaceChild($th, $td); // replace
$i--;
}
}
}
}
// Output
echo $document->saveHTML();
Problem is, childNodes includes the whitespace text nodes between each tag.
To get the <tr> tags in $theads, use getElementsByTagName, eg
foreach ($theads->getElementsByTagName('tr') as $tr) {
// ...
}
Alternatively, if you're after all the <td> elements in the first <thead>, try XPath
$xpath = new DOMXPath($document);
$tds = $xpath->query('//thead[1]/tr/td'); // xpath indexes are 1-based
I have a simple HTML construct:
<table>
<tr>
<td class="myStyle">
Name of URL a
</td>
</tr>
<tr>
<td class="myStyle">
Name of URL b
</td>
</tr>
</table>
Now I want to find out with PHP DOM how to get this URL and perhaps the name.
while($table = $tables->item($i++))
{
$class_node = $table->attributes->getNamedItem('class');
if($class_node)
{
if ($table->attributes->getNamedItem('class')->value == "myStyle") {
$links = $tables->item($i)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<br>" . $link->nodeName;
}
//echo "Class is : " . $table->attributes->getNamedItem('class')->value . PHP_EOL;
}
}
}
So far I can print out every row that has the class "myStyle". But I'm not able to get Access to any value or href from there.
I know XPath is much more convenient but I want to try it first with DOM.
In XPath I know my Position when traversing through the DOM. But here with DOM I guess I'm not at that position in my Loop I want.
Proceed like this..
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('td') as $tag) {
if ($tag->getAttribute('class') === 'myStyle') {
foreach ($tag->getElementsByTagName('a') as $atag) {
echo $atag->getAttribute('href');
}
}
}
OUTPUT :
URLa
URLb
Demo
I want to get all the td of the ninth table element in an HTML page.
I started with this , but I dont know how to finish it :
define('GLPI_ROOT', '..');
$content = GLPI_ROOT . "/front/yourpage.html";
$dom = new DOMDocument();
#$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$attbs = $xpath->query("//table td");
foreach($attbs as $a) {
print $a->nodeValue;
}
And I have tried this one too, but it didn't work :
$dom = new DOMDocument();
$dom->loadHTMLFile("yourpage.html");
$tables = $dom->getElementsByTagName('table');
$table = $tables->item(8);
foreach ($table->childNodes as $td) {
if ($td->nodeName == 'td') {
echo $td->nodeValue, "\n";
echo "<script type=\"text/javascript\"> alert('".$td->nodeValue."');</script>";
}
}
I'm getting this error :
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in yourpage.html, line: 33
Your first one doesn't work because you don't seem to be understanding how XPath works, but your second one is probably a better option.
That said, when has <td> ever been a first-level child of <table>? You could use getElementsByTagName again on the $table, that'd work quite well.
I wrote some codes in PHP to extract content of some elements from other websites. These elements are addressed by XPath. These codes worked for one website successfully but failed for the other one. Therefore, I am sure that not the whole code is incorrect.
by the way: I extracted the element's XPath address by using 'Inspect Element' in Firefox and Right Click on the element and choosing 'Copy XPath'.
What is wrong for the second website?
thanks
Here is the code:
//MyCode.PHP
<html>
<head>
<title>This is the title</title>
</head>
<body>
<?php
class EmDIV
{
public $url="";
public $content="";
public $name="";
public $query="";
public function EmDIV($CdivName,$Curl,$CQuery)
{
$this->name=$CdivName;
$this->url=$Curl;
$this->query=$CQuery;
$html = new DOMDocument();
#$html->loadHtmlFile($this->url);
$bodies = $html->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($this->query);
//echo #$body->saveHTML();
if($nodelist->length==1)
{
$this->content=$nodelist->item(0)->textContent;
//sanitizing
//$this->content=Jsoup.clean($this->content, Whitelist.basic());
}
//echo $nodelist->item(0)->nodeName;
foreach ($nodelist as $node)
echo $node->getNodePath()."\n";
}
}
$emdiv=array(
//new EmDIV('parsmalaysia','http://www.parsmalaysia.com/exchange.html','/html/body/div/div[5]/div/div/div/div/div/div/div/div/div/div/table/tbody/tr/td[4]/text()'),
new EmDIV('atlas-exchange','http://atlas-exchange.com/','/html/body/div/div/table/tr/td/table/tr/td/div/div[10]/div/div/div/table/tr[3]/td[2]/text()'),
new EmDIV('usunmalaysia','http://www.usunmalaysia.com/Home.aspx','/html/body/form/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr/td/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody/tr[4]/td[2]/text()'),
);
?>
<table border="1">
<tr>
<td>Site</td>
<td>RM Price</td>
</tr>
<?php
foreach ($emdiv as $ed)
{
echo "<tr>";
echo "<td>".$ed->name."</td>";
echo "<td>".$ed->content."</td>";
echo "</tr>";
}
?>
</table>
</body>
</html>
In the second page there are no tbody elements. You have to remove all tbody elementes from the path. Firefox show the tbody elements because they are part of the internal structure of the tables, but they are not in the markup you are retrieving.