DomDocument php extract info and images - php

Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks

If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />

Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}

Related

DOM object - special characters will not shown correctly

I have the following php code:
$html = '<table>
<tr>
<td data-label="Date">übermittelt</td>
<td data-label="Location">xxx</td>
</tr>
<tr>
<td data-label="Date">xD2</td>
<td data-label="Location">xxx</td>
</tr>
</table>';
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $html; // NO PROBLEM WITH SPECIAL CHARACTERS
$nodes = $dom->getElementsByTagName('td');
echo $nodes->item(0)->nodeValue; // PROBLEM WITH SPECIAL CHARACTERS
My Problem is, that my last echo shows the result like this:
übermittelt
The echo $html shows the result correctly like this:
übermittelt
What can I do to solve this issue?
thanks for your support:
solution was to defined this line correctly like this:
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $html);

Extract all <a> from table with DOMdocument

I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.
To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}

How to extract hyperlink using php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.
So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';
Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}
You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

DOMDocument Query

Hi I am very new to this World of DOMDocument,Im still learning and looking for xpath query use in DOMDocument.The html sometimes changes so a preg_match is not a good idea. .I need to get the values from a html file.This is the part of html i want to get. I would be happy if you could help me..
<?php
$doc = new DOMDocument();
#$doc->loadHTML('<table cellspacing="0" cellpadding="0" align="center" class="results">
<tr class="header" bgcolor="#0000FF">
<td>
</td>
<td>Name/AKAs</td>
<td>Age</td>
<td>Location</td>
<td>Possible Relatives</td>
</tr>
<tr>
<td>1.</td>
<td>
<a class="LN" href=""><b>Iron, Man E</b></a>
</td>
<td align="center">54</td>
<td>
Canada, AK<br />
California, AK<br />
</td>
<td>
</td>
<td>
View Details
</td>
</tr>
<tr><td>2.</td>
<td>
<a class="LN" href=""><b>Bat, Man E</b></a></td>
<td align="center">26</td>
<td>
Gotham, IA
<br /></td>
<td>
View Details</td></tr>
</table>');
$xpath = new DOMXPath($doc);
$xquery = '//a[#class="LN"]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
?>
How do I get the following value? I can only get Iron, Man E, and Bat, Man E
Iron, Man E | 54 | Canada, AK;California, AK
Bat, Man E | 26 | Gotham, IA
My Answer is not about DomDocument Query but can solve your problem easily.
There is a Library named SIMPLEHTMLDOM ! You can do great things with it.
Example :
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Full Documentation (Power of this Lib) is Here.
Try this,
$xquery = '//a'; // you will get all anchor tags now
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
Try this to get in a single line,
$xpath = new DOMXPath($doc);
$xquery = '//tr[td[a]]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}

PHP DOMDocument value of element without child-elements

I got a DOMDocument which looks like this:
<font size="6" face="Arial">
CONTENT
<font size="5" face="Arial">...</font>
<br>
<table cellspacing="1" cellpadding="1" border="3" bgcolor="#E7E7E7" rules="all">...</table>
<table cellspacing="1" cellpadding="1">...</table>
<font size="3" face="Arial" color="#000000">...</font>
</font>
Now I want to get just CONTENT and not all the other child-elements.
How can I do that?
What you can do is grab the first DOMText node that's a child of the first <font> tag.
// Get the first <font> tag
$font = $doc->getElementsByTagName( 'font')->item(0);
// Find the first DOMText element
$first_text = null;
foreach( $font->childNodes as $child) {
if( $child->nodeType === XML_TEXT_NODE) {
$first_text = $child;
break;
}
}
if( $first_text != null) {
echo 'OUTPUT: ' . $first_text->textContent;
}
You can see from the demo that this prints:
OUTPUT: CONTENT
Shorter:
$output = $xml->getElementsByTagName("font")->item(1)firstChild->textContent;
nickb's solution works too and is even better if the CONTENT comes after one of the sub-childs. But since it doesn't do that in my case, this one is shorter.

Categories