DOMDocument Query

DOMDocument Query - php

Hi I am very new to this World of DOMDocument,Im still learning and looking for xpath query use in DOMDocument.The html sometimes changes so a preg_match is not a good idea. .I need to get the values from a html file.This is the part of html i want to get. I would be happy if you could help me..
<?php
$doc = new DOMDocument();
#$doc->loadHTML('<table cellspacing="0" cellpadding="0" align="center" class="results">
<tr class="header" bgcolor="#0000FF">
<td>
</td>
<td>Name/AKAs</td>
<td>Age</td>
<td>Location</td>
<td>Possible Relatives</td>
</tr>
<tr>
<td>1.</td>
<td>
<a class="LN" href=""><b>Iron, Man E</b></a>
</td>
<td align="center">54</td>
<td>
Canada, AK<br />
California, AK<br />
</td>
<td>
</td>
<td>
View Details
</td>
</tr>
<tr><td>2.</td>
<td>
<a class="LN" href=""><b>Bat, Man E</b></a></td>
<td align="center">26</td>
<td>
Gotham, IA
<br /></td>
<td>
View Details</td></tr>
</table>');
$xpath = new DOMXPath($doc);
$xquery = '//a[#class="LN"]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
?>
How do I get the following value? I can only get Iron, Man E, and Bat, Man E
Iron, Man E | 54 | Canada, AK;California, AK
Bat, Man E | 26 | Gotham, IA

My Answer is not about DomDocument Query but can solve your problem easily.
There is a Library named SIMPLEHTMLDOM ! You can do great things with it.
Example :
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Full Documentation (Power of this Lib) is Here.

Try this,
$xquery = '//a'; // you will get all anchor tags now
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
Try this to get in a single line,
$xpath = new DOMXPath($doc);
$xquery = '//tr[td[a]]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}

Related

php preg_match with get between two div value

How can I get the value of this text.
Idea:
Year: 2012
KM: 69.000
Color: Blue
Price: 29.9000
preg_match('#</div></td><td
class=\"searchResultsAttributeValue\">(.*?)<\/td>#si',$string,$val);
$string = "<div class="classifiedSubtitle">Opel > Astra > 1.4 T Sport</div>
</td>
<td class="searchResultsAttributeValue">
2012</td>
<td class="searchResultsAttributeValue">
69.000</td>
<td class="searchResultsAttributeValue">
Blue</td>
<td class="searchResultsPriceValue">
<div> $ 29.900 </div></td>
<td class="searchResultsDateValue">
<span>21 Nov</span>
<br/>
<span>2016</span>
</td>
<td class="searchResultsLocationValue">
USA<br/>Texas</td>"

The best solution isn't with regex. You should do it with Dom.
$dom = new DOMDocument();
$dom->loadHTML($string);
$xPath = new DOMXpath($dom);
$tdValue = $xPath->query('//td[#class="searchResultsAttributeValue"]')->get(0)->nodeValue;
This way you'll get the td element with the class searchResultsAttributeValue. Of course you should verify if this element really exists, and some other verifications but that's the way.
Hope I was helpful.

Extract text and image src with PHP DomDocument

I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.

You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)

DomDocument php extract info and images

Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks

If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />

Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}

How to extract hyperlink using php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.

So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';

Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}

You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

PHP Regex HTML - Extract URL

I am trying to extract multiple URLs from HTML file with regex.
There are other URLs in the file, do the only pattern i have is "tableentries." and ""
HTML code example:
<tr class="tableentries2">
<td>
Click Here
</td>
PHP I wrote:
$html = "value of the code above"
if(preg_match_all('/<td>.*</td>/', $html, $match)){
foreach($match[0] as $x){
echo $x . "<br>";
}}

Why not just look for href values? (Updated because the edited code now has quotation marks.)
preg_match_all('/href="([^\s"]+)/', $html, $match);
Then the URI would be in $match[1][0].

You really shouldn't use regex to parse HTML. DOMDocument is actually very easy to use for this type of thing. here is a simple example.
<?php
error_reporting(E_ALL);
$html = "
<table>
<tr>
<td>
<a href='http://www.test1-1.com'>test1-1</a>
</td>
<td>
<a href='http://www.test1-2.com'>test1-2</a>
</td>
<td>
<a href='http://www.test1-3.com'>test1-3</a>
</td>
</tr>
<tr>
<td>
<a href='http://www.test2-1.com'>test2-1</a>
</td>
<td>
<a href='http://www.test2-2.com'>test2-2</a>
</td>
<td>
<a href='http://www.test2-3.com'>test2-3</a>
</td>
</tr>
</table>";
$DOM = new DOMDocument();
//load the html string into the DOMDocument
$DOM->loadHTML($html);
//get a list of all <A> tags
$a = $DOM->getElementsByTagName('a');
//loop through all <A> tags
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br />';
}
?>
This would output:
http://www.test1-1.com
http://www.test1-2.com
http://www.test1-3.com
http://www.test2-1.com
http://www.test2-2.com
http://www.test2-3.com

<?php
preg_match_All("#<a\s[^>]*href\s*=\s*[\'\"]??\s*?(?'path'[^\'\"\s]+?)[\'\"\s]{1}[^>]*>(?'name'[^>]*)<#simU", $html, $hrefs, PREG_SET_ORDER);
foreach ($hrefs AS $urls){
print $urls['path']."<br>";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMDocument Query - php

Related

php preg_match with get between two div value

Extract text and image src with PHP DomDocument

DomDocument php extract info and images

How to extract hyperlink using php

PHP Regex HTML - Extract URL

Categories

Resources