Extract all <a> from table with DOMdocument - php

I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.

To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}

Related

HTML parsing with php

Can anyone help in parsing this part of an HTML site? I use php and PHP:DOM
I would like to get the Klassifikation and Schlagwörter in one php string.
How is this done?
Thanks
<tr style="display:table-row;">
<td id="TREFWOORD" class="onOffLink"></td>
<td class="rec_lable"><div>
<span>Schlagwörter</span><span>: </span>
</div></td>
<td class="rec_title"><div>
<span>*</span><span><a class="
link_gen
" href="MAT=/NOMAT=T/REL?PPN=106189719">Recht</a></span><span>
</span><span><a href="http://"
target=""><img src="http://"
alt="Subject" title="Subject" class="img_link"></a></span><span> / </span>
<span><a class="
link_gen
" href="MAT=/NOMAT=T/CMD?
ACT=SRCHA&IKT=5040&TRM=Wo%CC%88rterbuch">Wörterbuch</a></span>
</div></td>
</tr>
<tr style="display:table-row;">
<td></td>
<td class="rec_lable"><div><span>Klassifikation: </span></div></td>
<td class="rec_title"><div>
<span>Basisklassifikation: </span><span><a class="
link_gen
" target=""><img
src="http://" alt="Subject"
title="Subject" class="img_link"></a></span>
</div></td>
</tr>
I tried this without success:
<?php
$url='http://...'
$easycurlcmd=sprintf("curl '%s' -o ./libbvhtml.txt", $url);
printf("Execute: CURL1 ".$easycurlcmd."\n");
exec($easycurlcmd);
$html=file_get_contents('./libbvhtml.txt');
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$rec_lable = $xpath->query("//tr/*[contains(#class, rec_lable')]/div/span[1]");
echo $rec_lable->item(0)->nodeValue; // Schlagwörter
echo $rec_lable->item(1)->nodeValue; // Klassifikation
The reason was that curl must be defined with the redirect option.
Thanks to all.
You need to use DOMDocument::loadHTML to parsing HTML and use DOMXPath::query to searching in DOM.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$rec_lable = $xpath->query("//tr/*[contains(#class, 'rec_lable')]/div/span[1]");
echo $rec_lable->item(0)->nodeValue; // Schlagwörter
echo $rec_lable->item(1)->nodeValue; // Klassifikation
Check result in demo

Extract text and image src with PHP DomDocument

I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.
You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)

DomDocument php extract info and images

Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks
If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />
Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}

How to extract hyperlink using php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.
So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';
Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}
You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

php DomXPath - how to get image in current node only and not in child nodes?

i need to get only image that in current node and not in child nodes
i want to get only green/yellow/red/black images without not_important.gif image
i can use query './/table/tr/td/img'
but i need it inside loop
<?php
/////////////////////////////////////////////////////////////////////
$html='
<table>
<tr>
<td colspan="2">
<span>
<img src="not_important.gif" />
</span>
<img src="green.gif" />
</td>
</tr>
<tr>
<td>
<span>yellow</span>
<img src="yellow.gif" />
</td>
<td>
<span>red</span>
<img src="red.gif" />
</td>
</tr>
</table>
<table>
<tr>
<td>
<span>
<img src="not_important.gif" />
</span>
<img src="black.gif" />
</td>
</tr>
</table>
';
/////////////////////////////////////////////////////////////////////
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
/////////////////////////////////////////////////////////////////////
$query = $xpath->query('.//table/tr/td');
for( $x=0,$results=''; $x<$query->length; $x++ )
{
$x1=$x+1;
$image = $query->item($x)->getELementsByTagName('img')->item(0)->getAttribute('src');
$results .= "image $x1 is : $image<br/>";
}
echo $results;
/////////////////////////////////////////////////////////////////////
?>
can i do it through $query->item()->
i tried has_attributes and getElementsByTagNameNS and getElementById
but i failed ::
Replace:
$image = $query->item($x)->getELementsByTagName('img')->item(0)->getAttribute('src');
...with:
$td = $query->item($x); // grab the td element
$img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element
$image = $img->getAttribute('src'); // grab the source of the image
In other words, use the XPath object again to query, but now for ./img, relative to the context node you provide as the second argument to query(). The context node being one of the elements (td) of the earlier result.
The query //table/tr/td/img should work just fine as the unwanted images all reside in <span> elements.
Your loop would look like
$images = $xpath->query('//table/tr/td/img');
$results = '';
for ($i = 0; $i < $images->length; $i++) {
$results .= sprintf('image %d is: %s<br />',
$i + 1,
$images->item($i)->getAttribute('src'));
}

Categories