php domdocument get node info - php

I am working with php and I am trying to get certain data from a webpage
everything works till i get to this part:
<a class="cleanthis" href="https://www.web.com" id="1122" rel="#1122" style="display: inline-block;"><strong>the data i want</strong></a>
As you can see i want the data in strong but i cant get it. I only get blank lines
code i use:
foreach($as as $a) {
if ($a->getAttribute('class') === 'cleanthis') {
$strong = $a->getElementsByTagName('strong');
echo $strong->nodeValue;;
}

You should be seeing this error message:
Undefined property: DOMNodeList::$nodeValue
That is because $strong = $a->getElementsByTagName('strong'); will put a DOMNodeList in $string. You either need to iterate the list or retrieve the actual node from it, e.g.
echo $strong->item(0)->nodeValue;
Or you can just use XPath:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//a[#class="cleanthis"]/strong/text()') as $element) {
echo $element->nodeValue, PHP_EOL;
}

Related

PHP undefined method DOMNodeList::setAttribute()

I get this error message from PHP: "undefined method DOMNodeList::setAttribute()" from line 9. I am trying to change the src of an image in my HTML at my server and so far this is my code:
<?php
if (isset($_POST['id']) && isset($_POST['name']))
{
$id = $_POST['id'];
$name = $_POST['name'];
$html = $_POST['html'];
$dom = new domDocument;
$dom->loadHTML($html);
$node = $dom->getElementsByTagName( 'img' );
$node ->setAttribute('src', 'images/' . $name);//line 9
echo $dom->saveHTML();
}
echo 'error';
exit;
//html
<div><img id="picture" src=""></div>
The variable 'id' is the HTML id of the specific line of HTML, name is the the name of the image and HTML is the line of HTML.
As far as I can understand from researching I select a specific line of HTML which I then load to my DOM variable. I then specify the element ie: "img" which I can then edit through the use of setAttribute however this does not work. I only want to change the source of this one img with the ID of "picture".
But the DOMNodeList doesn't have that method.
getElementsByTagName is part of the DOMDocument class.
You don't need to cast anything, just call the method:
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$spans = $link->getElementsByTagName('span');
}
And by the way, DOMElement is a subclass of DOMNode. If you were talking about a DOMNodeList, then accessing the elements in such a list can be done, be either the method presented above, with a foreach loop, either by using the item() method of DOMNodeList
$link_0 = $dom->getElementsByTagName('a')->item(0);
getElementsByTagName returns a list of nodes. Just try with:
$nodes = $dom->getElementsByTagName('img');
foreach ($nodes as $node) {
$node->setAttribute('src', 'images/' . $name);
}

Retrieve data from html page using xpath and php

I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.
You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}

DomXPath with DOMDocument to get <img> Class URL

I am writing a little scraper script that will find the image URL that has a particular class name. I know that my cURL and DOMDocument is functioning okay, and even the DomXPath really (as far as I can tell, there are no errors) But I am struggling to work out how to get the URL of the xpath query results.
My code so far:
$dom = new DOMDocument();
#$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="productImage"]');
var_dump($div);
echo $div->item(0);
If I var_dump($x) the page outputs no problem. So the CURL is working fine. But I do not know how to get the data that is contained in the $div. I am trying to find an Image with a class of 'productImage' which looks like:
<img src="/uploads/5W/yP/5WyPP4l7Z-jmZRzu_MJ6zg/1077-d.jpg" border="1" alt="Album" class="productImage">
I want the source of that image tag.
Any suggestions?
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$imgs = $xpath->query('//*[#class="productImage"]');
foreach($imgs as $img)
{
echo 'ImgSrc: ' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Try that...
== EDIT: Additional Info ==
The reason I use a loop here is because you may find more than one img. If you know there is only one element (or you want the first dom node found) you can access the elelement from the domnodelist via the item method of domnodelist - like so:
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$img = $xpath->query('//*[#class="productImage"]');
echo 'ImgSrc: ' . $img->item(0)->getAttribute('src') .'<br />' . PHP_EOL;
You don't actually need to use XPath here, because it seems that you're just after images and that can be done by using DOMDocument::getElementsByTagName(), followed by a simple filter:
foreach ($dom->getElementsByTagName('img') as $image) {
$class = $image->getAttribute('class');
if (strpos(" $class ", " productImage ") !== false) {
$url = $image->getAttribute('src');
// do stuff
}
}
Then, you can get the src attribute by using DOMElement::getAttribute():
echo $image->getAttribute('src');

How can I grab div content to display in another page?

I've searched around for solutions to this question, but each one i find, and try doesn't work.
I'm trying to grab the content of a div from a forum topic.
I've tried using preg_match and that only displayed "Array" then I tried using this method
$html = file_get_contents("http://www.lcs-server.co.uk/forum/index.php/topic,$id_topic");
$dom = new DOMDocument;
$dom->loadHTML($html);
$element = $dom->getElementById("msg_$id_msg");
var_dump($element);
This will show "object(DOMElement)#1 (0) { } "
The $id_topic and $id_msg are defined above this code, taken from the forum database. I did try taking the message from the forum database, but it displayed BB code tags, I'd like it to grab the post content, and display it in HTML, as it's displayed on the forum post itself.
This is the code I'm using now and giving me "Fatal error: Cannot redeclare DOMinnerHTML()"
$html = file_get_contents("http://www.lcs-server.co.uk/forum/index.php/topic,$id_topic");
$dom = new DOMDocument;
$dom->loadHTML($html);
$domelement = $dom->getElementById("msg_$id_msg");
foreach ($domelement as $element)
{
echo DOMinnerHTML($element);
}
function DOMinnerHTML($DOMelement)
{
$innerHTML = "";
$children = $DOMelement->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
getElementById returns a DOM node object. It does not return the HTML of the node. For that, you have to get the node's "innerHTML". This properly is not officially supported by PHP's dom object for some reason, but can be faked using this answer: How to get innerHTML of DOMNode?

PHP DOMDocument error handling

I'm having trouble trying to write an if statement for the DOM that will check if $html is blank. However, whenever the HTML page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank).
$html = file_get_contents("http://example.com/");
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementById('dividhere')->getElementsByTagName('img');
foreach ($links as $link)
{
echo $link->getAttribute('src');
}
All this does is grab an image URL in the specified div, which works perfectly until the page is a blank HTML page.
I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both?
include_once('simple_html_dom.php')
$html = file_get_html("http://example.com/");
foreach($html->find('div[id="dividhere"]') as $div)
{
if(empty($div->src))
{
continue;
}
echo $div->src;
}
Get rid on the $html variable and just load the file into $dom by doing #$dom->loadHTMLFile("http://example.com/");, then have an if statement below that to check if $dom is empty.

Categories