How to get the hyperlink with PHP DOM? - php

I have a simple HTML construct:
<table>
<tr>
<td class="myStyle">
Name of URL a
</td>
</tr>
<tr>
<td class="myStyle">
Name of URL b
</td>
</tr>
</table>
Now I want to find out with PHP DOM how to get this URL and perhaps the name.
while($table = $tables->item($i++))
{
$class_node = $table->attributes->getNamedItem('class');
if($class_node)
{
if ($table->attributes->getNamedItem('class')->value == "myStyle") {
$links = $tables->item($i)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<br>" . $link->nodeName;
}
//echo "Class is : " . $table->attributes->getNamedItem('class')->value . PHP_EOL;
}
}
}
So far I can print out every row that has the class "myStyle". But I'm not able to get Access to any value or href from there.
I know XPath is much more convenient but I want to try it first with DOM.
In XPath I know my Position when traversing through the DOM. But here with DOM I guess I'm not at that position in my Loop I want.

Proceed like this..
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('td') as $tag) {
if ($tag->getAttribute('class') === 'myStyle') {
foreach ($tag->getElementsByTagName('a') as $atag) {
echo $atag->getAttribute('href');
}
}
}
OUTPUT :
URLa
URLb
Demo

Related

Unable to get both child elements with xpath from xhtml using xquery in php to manipulate

The xhtml data I need to get the childNodes from I don't need the child from the TH childNODES
<table>some data</table>
<table>
<tr>
<td class="c2">PCI Signal Error (SERR#) Enable</td>
<td>Yes</td>
</tr>
<tr>
<td class="c1">Controller Type 1</td>
<td>CISS</td>
</tr>
<tr>
<td class="c2">bus type</td>
<td>CISS</td>
</tr>
<tr>
<th><a name="systempcibus5">PCI Bus 31</a></th>
<td>Device</td>
</tr>
</table>
below is the latest attempt, I only want to get the textContent for the TD's in the above xml
so I can build a mysql statement to insert the data in mySql
I have tried so many variations over the last week.
I get this error. I won't bore you with all the various things I tried, but I believe this is the closest to what I want.
PHP Notice: Trying to get property of non-object in C:\inetpub\wwwroot\reports\gec\test1.php on line 40
<?php
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query('/html/body/table[2]/tr');
//$nodes = $xpath->query("//tr[contains(concat(' ', #class, ' '), ' head ') ");
//header("Content-type: text/plain");
$node_count=$nodes->length ;
for( $i = 1; $i <= intval($node_count); $i++)
{
$node_td1 = $xpath->query('/html/body/table[2]/tr[$i]/td[1]');
$node_td2 = $xpath->query('/html/body/table[2]/tr[$i]/td[2]');
$result1=$node_td1->textContent;
$result2=$node_td2->textContent;
echo $result1 . "," . $result2 . "<br>";
}
Alternatively, you could just point out the row itself, then filter them out using that ->tagName:
$dom = new DomDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DomXPath($dom);
$rows = $xpath->query('/html/body/table[2]/tr');
foreach ($rows as $row) {
foreach($row->childNodes as $col) {
if(isset($col->tagName) && $col->tagName != 'th') {
echo $col->textContent . '<br/>';
}
}
echo '<hr/>';
}
Or with using xpath, to reference each row:
foreach ($rows as $row) {
$col1 = $xpath->evaluate('string(./td[1])', $row);
$col2 = $xpath->evaluate('string(./td[2])', $row);
echo $col1 . '<br/>';
echo $col2 . '<br/>';
echo '<hr/>';
}
Sample Output

Simple HTML DOM Not Finding DIV

I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?
This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}
I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)
try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')

PHP DOM nodeValue not doesn't work

I am trying to parse an HTML table with DOM and it works fine but when some cell contains html it doesn't work properly.
Here is the Sample HTML Table
<tr>
<td>Razon Social: </td>
<td>Circulo Inmobiliaria Sur (Casa Central)</td>
</tr>
<tr>
<td>Email: </td>
<td> <img src="generateImage.php?email=myemail#domain.com"/> </td>
</tr>
And PHP Code:
$rows = $dom->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cells = $row->getElementsByTagName('td');
if(strpos($cells->item(0)->textContent, "Razon") > 0)
{
$_razonSocial = $cells->item(1)->textContent;
}
else if(strpos($cells->item(0)->textContent, "Email") > 0)
{
$_email = $cells->item(1)->textContent;
}
}
echo "Razon Social: $_razonSocial<br>Email: $_email";
OUTPUT:
Razon Social: Circulo Inmobiliaria Sur (Casa Central)
Email:
Email is empty, it must be:
<img src="generateImage.php?email=myemail#domain.com"/>
I have even tried
$cells->item(1)->nodeValue;
instead of
$cells->item(1)->textContent;
But that too doesn't work. How I can make it return HTML value?
Give id to your table as item_specification
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$table = $x->query("//*[#id='item_specification']/tr");
$rows = $table;
foreach ($rows as $row) {
$atr_name = $row -> getElementsByTagName('td')->item(0)->nodeValue;
$atr_val = $row -> getElementsByTagName('td')->item(1)->nodeValue;
}
echo " {$atr_name} - {$atr_val} <br \>";
Its working fine.
As I already mentioned, <img src="generateImage.php?email=myemail#domain.com"/> is not a text. It's another html-entity. So try this:
if(strpos($cells->item(0)->textContent, "Razon") !== false) {
$_razonSocial = $cells->item(1)->textContent;
} else if(strpos($cells->item(0)->textContent, "Email") !== false) {
$count = 0;
// here we get all child nodes of td.
// space before img-tag is also a child node, but it has type DOMText
// so we skip it.
foreach ($cells->item(1)->childNodes as $child) {
if (++$count == 2)
$_email = $child->getAttribute('src');
}
// now in $_email you have full src value and can somehow extract email
}

Dom parsing issue empty result

<tr class='Jed01'>
<td height='20' class='JEDResult'>1</td>
<td height='30' class='JEDResult'>26.04.2013</td>
<td height='30' class='JEDResult'>19:43</td>
<td height='30' class='JEDResult'>Processing</td>
<td height='30' class='JEDResult'><a href="#" pressed="GetInfo(1233);" title=''>Jeddah</a></td>
</tr>
Result = first step - date - time - state - place
First of all I am new to PHP and I am trying to parse this data to my web via PHP - DOM as recommended to me before on Stackoverflow. In the code below I have called all classes to get data but I can't get any result while there is no any issue. So please where it could be my issue?
Thanks from now
<?php
$input = "www.kalkatawi.com/luai.html"
$html = new DOMDocument();
$html->loadHTML($input);
foreach($html->getElementsByTagName('tr') as $tr)
{
if($tr->getAttribute('class') == 'Jed01')
{
foreach($html->getElementsByTagName('td') as $td)
{
if($td->getAttribute('class') == 'JEDResult')
{
echo ($td->nodeValue);
}
}
}
}
?>
Dont forget those semi colons ;)
Try this;
<?php
$input = file_get_contents("http://www.kalkatawi.com/luai.html");
$html = new DOMDocument();
$html->loadHTML($input);
foreach($html->getElementsByTagName('tr') as $tr)
{
if($tr->getAttribute('class') == 'Jed01')
{
foreach($tr->getElementsByTagName('td') as $td)
{
if($td->getAttribute('class') == 'JEDResult')
{
echo ($td->nodeValue);
echo '<br/>';
}
}
}
echo '<br/><br/>';
}
?>
Should output;
1
26.04.2013
19:43
Processing
Jeddah
2
26.04.2013
20:43
Printed
RIY
There are several problems with this code.
Loading the HTML
$input = 'MyLink';
$html = new DOMDocument();
$html->loadHTML($input);
This code attempts to treat the string 'MyLink' as HTML, which obviously it is not. If that's your actual code then nothing would work beyond this point. Either provide proper HTML input or use loadHTMLFile to load HTML from a file.
Comparisons are case-sensitive
On the one hand there is this:
<tr class='Jed01'>
And on the other this:
if($tr->getAttribute('class') == 'JED01')
Since 'Jed01' != 'JED01' this will never be true. Either fix the casing or use some other mechanism such as stricmp to compare the classes.
Objects cannot be printed
This results in a fatal error:
echo ($td);
What it should be instead: most likely echo $td->nodeValue, but other possibilities are open depending on what you want to do.
But you could do it much more easily with XPath
$xpath = new DOMXPath($html);
$query = "//tr[#class='Jed01']//td[#class='JEDResult']"; // google XPath syntax
foreach ($xpath->query($query) as $node) {
print_r($node->nodeValue);
}

My codes cannot extract the XPath correctly in PHP

I wrote some codes in PHP to extract content of some elements from other websites. These elements are addressed by XPath. These codes worked for one website successfully but failed for the other one. Therefore, I am sure that not the whole code is incorrect.
by the way: I extracted the element's XPath address by using 'Inspect Element' in Firefox and Right Click on the element and choosing 'Copy XPath'.
What is wrong for the second website?
thanks
Here is the code:
//MyCode.PHP
<html>
<head>
<title>This is the title</title>
</head>
<body>
<?php
class EmDIV
{
public $url="";
public $content="";
public $name="";
public $query="";
public function EmDIV($CdivName,$Curl,$CQuery)
{
$this->name=$CdivName;
$this->url=$Curl;
$this->query=$CQuery;
$html = new DOMDocument();
#$html->loadHtmlFile($this->url);
$bodies = $html->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($this->query);
//echo #$body->saveHTML();
if($nodelist->length==1)
{
$this->content=$nodelist->item(0)->textContent;
//sanitizing
//$this->content=Jsoup.clean($this->content, Whitelist.basic());
}
//echo $nodelist->item(0)->nodeName;
foreach ($nodelist as $node)
echo $node->getNodePath()."\n";
}
}
$emdiv=array(
//new EmDIV('parsmalaysia','http://www.parsmalaysia.com/exchange.html','/html/body/div/div[5]/div/div/div/div/div/div/div/div/div/div/table/tbody/tr/td[4]/text()'),
new EmDIV('atlas-exchange','http://atlas-exchange.com/','/html/body/div/div/table/tr/td/table/tr/td/div/div[10]/div/div/div/table/tr[3]/td[2]/text()'),
new EmDIV('usunmalaysia','http://www.usunmalaysia.com/Home.aspx','/html/body/form/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr/td/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody/tr[4]/td[2]/text()'),
);
?>
<table border="1">
<tr>
<td>Site</td>
<td>RM Price</td>
</tr>
<?php
foreach ($emdiv as $ed)
{
echo "<tr>";
echo "<td>".$ed->name."</td>";
echo "<td>".$ed->content."</td>";
echo "</tr>";
}
?>
</table>
</body>
</html>
In the second page there are no tbody elements. You have to remove all tbody elementes from the path. Firefox show the tbody elements because they are part of the internal structure of the tables, but they are not in the markup you are retrieving.

Categories