Retrieve data from html page using xpath and php - php

I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.

You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}

Related

php domdocument get node info

I am working with php and I am trying to get certain data from a webpage
everything works till i get to this part:
<a class="cleanthis" href="https://www.web.com" id="1122" rel="#1122" style="display: inline-block;"><strong>the data i want</strong></a>
As you can see i want the data in strong but i cant get it. I only get blank lines
code i use:
foreach($as as $a) {
if ($a->getAttribute('class') === 'cleanthis') {
$strong = $a->getElementsByTagName('strong');
echo $strong->nodeValue;;
}
You should be seeing this error message:
Undefined property: DOMNodeList::$nodeValue
That is because $strong = $a->getElementsByTagName('strong'); will put a DOMNodeList in $string. You either need to iterate the list or retrieve the actual node from it, e.g.
echo $strong->item(0)->nodeValue;
Or you can just use XPath:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//a[#class="cleanthis"]/strong/text()') as $element) {
echo $element->nodeValue, PHP_EOL;
}

DOMDocument grab html between two p tags [duplicate]

I'm trying to replace video links inside a string - here's my code:
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach ($doc->getElementsByTagName("a") as $link)
{
$url = $link->getAttribute("href");
if(strpos($url, ".flv"))
{
echo $link->outerHTML();
}
}
Unfortunately, outerHTML doesn't work when I'm trying to get the html code for the full hyperlink like <a href='http://www.myurl.com/video.flv'></a>
Any ideas how to achieve this?
As of PHP 5.3.6 you can pass a node to saveHtml, e.g.
$domDocument->saveHtml($nodeToGetTheOuterHtmlFrom);
Previous versions of PHP did not implement that possibility. You'd have to use saveXml(), but that would create XML compliant markup. In the case of an <a> element, that shouldn't be an issue though.
See http://blog.gordon-oheim.biz/2011-03-17-The-DOM-Goodie-in-PHP-5.3.6/
You can find a couple of propositions in the users notes of the DOM section of the PHP Manual.
For example, here's one posted by xwisdom :
<?php
// code taken from the Raxan PDI framework
// returns the html content of an element
protected function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
?>
The best possible solution is to define your own function which will return you outerhtml:
function outerHTML($e) {
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($e, true));
return $doc->saveHTML();
}
than you can use in your code
echo outerHTML($link);
Rename a file with href to links.html or links.html to say google.com/fly.html that has flv in it or change flv to wmv etc you want href from if there are other href
it will pick them up as well
<?php
$contents = file_get_contents("links.html");
$domdoc = new DOMDocument();
$domdoc->preservewhitespaces=“false”;
$domdoc->loadHTML($contents);
$xpath = new DOMXpath($domdoc);
$query = '//#href';
$nodeList = $xpath->query($query);
foreach ($nodeList as $node){
if(strpos($node->nodeValue, ".flv")){
$linksList = $node->nodeValue;
$htmlAnchor = new DOMElement("a", $linksList);
$htmlURL = new DOMAttr("href", $linksList);
$domdoc->appendChild($htmlAnchor);
$htmlAnchor->appendChild($htmlURL);
$domdoc->saveHTML();
echo ("<a href='". $node->nodeValue. "'>". $node->nodeValue. "</a><br />");
}
}
echo("done");
?>

DomXPath with DOMDocument to get <img> Class URL

I am writing a little scraper script that will find the image URL that has a particular class name. I know that my cURL and DOMDocument is functioning okay, and even the DomXPath really (as far as I can tell, there are no errors) But I am struggling to work out how to get the URL of the xpath query results.
My code so far:
$dom = new DOMDocument();
#$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="productImage"]');
var_dump($div);
echo $div->item(0);
If I var_dump($x) the page outputs no problem. So the CURL is working fine. But I do not know how to get the data that is contained in the $div. I am trying to find an Image with a class of 'productImage' which looks like:
<img src="/uploads/5W/yP/5WyPP4l7Z-jmZRzu_MJ6zg/1077-d.jpg" border="1" alt="Album" class="productImage">
I want the source of that image tag.
Any suggestions?
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$imgs = $xpath->query('//*[#class="productImage"]');
foreach($imgs as $img)
{
echo 'ImgSrc: ' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Try that...
== EDIT: Additional Info ==
The reason I use a loop here is because you may find more than one img. If you know there is only one element (or you want the first dom node found) you can access the elelement from the domnodelist via the item method of domnodelist - like so:
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$img = $xpath->query('//*[#class="productImage"]');
echo 'ImgSrc: ' . $img->item(0)->getAttribute('src') .'<br />' . PHP_EOL;
You don't actually need to use XPath here, because it seems that you're just after images and that can be done by using DOMDocument::getElementsByTagName(), followed by a simple filter:
foreach ($dom->getElementsByTagName('img') as $image) {
$class = $image->getAttribute('class');
if (strpos(" $class ", " productImage ") !== false) {
$url = $image->getAttribute('src');
// do stuff
}
}
Then, you can get the src attribute by using DOMElement::getAttribute():
echo $image->getAttribute('src');

How can I grab div content to display in another page?

I've searched around for solutions to this question, but each one i find, and try doesn't work.
I'm trying to grab the content of a div from a forum topic.
I've tried using preg_match and that only displayed "Array" then I tried using this method
$html = file_get_contents("http://www.lcs-server.co.uk/forum/index.php/topic,$id_topic");
$dom = new DOMDocument;
$dom->loadHTML($html);
$element = $dom->getElementById("msg_$id_msg");
var_dump($element);
This will show "object(DOMElement)#1 (0) { } "
The $id_topic and $id_msg are defined above this code, taken from the forum database. I did try taking the message from the forum database, but it displayed BB code tags, I'd like it to grab the post content, and display it in HTML, as it's displayed on the forum post itself.
This is the code I'm using now and giving me "Fatal error: Cannot redeclare DOMinnerHTML()"
$html = file_get_contents("http://www.lcs-server.co.uk/forum/index.php/topic,$id_topic");
$dom = new DOMDocument;
$dom->loadHTML($html);
$domelement = $dom->getElementById("msg_$id_msg");
foreach ($domelement as $element)
{
echo DOMinnerHTML($element);
}
function DOMinnerHTML($DOMelement)
{
$innerHTML = "";
$children = $DOMelement->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
getElementById returns a DOM node object. It does not return the HTML of the node. For that, you have to get the node's "innerHTML". This properly is not officially supported by PHP's dom object for some reason, but can be faked using this answer: How to get innerHTML of DOMNode?

Echoing only a div with php

I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.

Categories