PHP DOMXPath Query - php

I am trying to read a href and the img inside that a tag using PHP DOMXPath query.
I am using below to get the "a" tag
$showPage = file_get_contents($url);
$dom = new DOMDocument();
$dom->validateOnParse = true;
$dom->loadHTML($showPage);
$dom->preserveWhiteSpace = false;
$allBollyList = new DOMXPath($dom);
$allBollyTableHTML = $allBollyList->query('//div[contains(#class, "covers")]//a');
foreach($allBollyTableHTML as $item) {
$sourceLink = $item->getAttribute("href");
}
However, the "a" in the HTML is as below.
<img src="http://test.com/test.jpg" alt="Song Name"><div></div>
I want to read the "img" tag and read "src" and "alt" inside that "img" tag.
can anyone please help as I am trying to do this in PHP as I am very new?
thanks

Never mind. Finally found the below solution to read "img" tag inside of "a" tag.
foreach($item->getElementsByTagName('img') as $img){
echo $img->getAttribute('src') . "\r\n";
}
So new for loop to read "allBollyTableHTML" will be as below.
foreach($allBollyTableHTML as $item) {
$sourceLink = $item->getAttribute("href");
foreach($item->getElementsByTagName('img') as $img){
echo $img->getAttribute('src');
}
}

Related

Image pulling script not working PHP

Not sure why the code below is not working, its displaying the "Else" value in the IF statement basically saying that there are no IMG tags found on the page but.. im sure they are there? any advice or guidance will be appreciated.
// This variable will contain all the HTML source code of the sample page
$htmlContent = file_get_contents('https://www.instagram.com/ken_flavius/');
var_dump($htmlContent);
// We'll add all the images in this array
$images = [];
// Instantiate a new object of class DOMDocument
$doc = new DOMDocument();
// Load the HTML doc into the object
$doc->loadHTML($htmlContent);
// Get all the IMG tags in the document
$elements = $doc->getElementsByTagName('img');
// If we get at least one result
if($elements->length > 0)
{
// Loop on all of the IMG tags
foreach($elements as $element)
{
// Get the attribute SRC of the IMG tag (this is the link of the image)
$src = $element->getAttribute('src');
if (strlen($src) > 0) {
// Add the link to the array containing all the links
array_push($images, $src);
}
}
//show all links
echo '<pre>'."\r\n";
print_r($images);
echo '</pre>'."\r\n";
} else {
// No result, it means that there were no IMG tags
echo 'no img tag found in the HTML source provided!';
}
Edited it to show the exact example that im using.
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}

Retrieve data from html page using xpath and php

I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.
You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}

DOMDocument grab html between two p tags [duplicate]

I'm trying to replace video links inside a string - here's my code:
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach ($doc->getElementsByTagName("a") as $link)
{
$url = $link->getAttribute("href");
if(strpos($url, ".flv"))
{
echo $link->outerHTML();
}
}
Unfortunately, outerHTML doesn't work when I'm trying to get the html code for the full hyperlink like <a href='http://www.myurl.com/video.flv'></a>
Any ideas how to achieve this?
As of PHP 5.3.6 you can pass a node to saveHtml, e.g.
$domDocument->saveHtml($nodeToGetTheOuterHtmlFrom);
Previous versions of PHP did not implement that possibility. You'd have to use saveXml(), but that would create XML compliant markup. In the case of an <a> element, that shouldn't be an issue though.
See http://blog.gordon-oheim.biz/2011-03-17-The-DOM-Goodie-in-PHP-5.3.6/
You can find a couple of propositions in the users notes of the DOM section of the PHP Manual.
For example, here's one posted by xwisdom :
<?php
// code taken from the Raxan PDI framework
// returns the html content of an element
protected function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
?>
The best possible solution is to define your own function which will return you outerhtml:
function outerHTML($e) {
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($e, true));
return $doc->saveHTML();
}
than you can use in your code
echo outerHTML($link);
Rename a file with href to links.html or links.html to say google.com/fly.html that has flv in it or change flv to wmv etc you want href from if there are other href
it will pick them up as well
<?php
$contents = file_get_contents("links.html");
$domdoc = new DOMDocument();
$domdoc->preservewhitespaces=“false”;
$domdoc->loadHTML($contents);
$xpath = new DOMXpath($domdoc);
$query = '//#href';
$nodeList = $xpath->query($query);
foreach ($nodeList as $node){
if(strpos($node->nodeValue, ".flv")){
$linksList = $node->nodeValue;
$htmlAnchor = new DOMElement("a", $linksList);
$htmlURL = new DOMAttr("href", $linksList);
$domdoc->appendChild($htmlAnchor);
$htmlAnchor->appendChild($htmlURL);
$domdoc->saveHTML();
echo ("<a href='". $node->nodeValue. "'>". $node->nodeValue. "</a><br />");
}
}
echo("done");
?>

DomXPath with DOMDocument to get <img> Class URL

I am writing a little scraper script that will find the image URL that has a particular class name. I know that my cURL and DOMDocument is functioning okay, and even the DomXPath really (as far as I can tell, there are no errors) But I am struggling to work out how to get the URL of the xpath query results.
My code so far:
$dom = new DOMDocument();
#$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="productImage"]');
var_dump($div);
echo $div->item(0);
If I var_dump($x) the page outputs no problem. So the CURL is working fine. But I do not know how to get the data that is contained in the $div. I am trying to find an Image with a class of 'productImage' which looks like:
<img src="/uploads/5W/yP/5WyPP4l7Z-jmZRzu_MJ6zg/1077-d.jpg" border="1" alt="Album" class="productImage">
I want the source of that image tag.
Any suggestions?
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$imgs = $xpath->query('//*[#class="productImage"]');
foreach($imgs as $img)
{
echo 'ImgSrc: ' . $img->getAttribute('src') .'<br />' . PHP_EOL;
}
Try that...
== EDIT: Additional Info ==
The reason I use a loop here is because you may find more than one img. If you know there is only one element (or you want the first dom node found) you can access the elelement from the domnodelist via the item method of domnodelist - like so:
$dom = new DOMDocument();
$dom->loadHTML($x);
$xpath = new DomXpath($dom);
$img = $xpath->query('//*[#class="productImage"]');
echo 'ImgSrc: ' . $img->item(0)->getAttribute('src') .'<br />' . PHP_EOL;
You don't actually need to use XPath here, because it seems that you're just after images and that can be done by using DOMDocument::getElementsByTagName(), followed by a simple filter:
foreach ($dom->getElementsByTagName('img') as $image) {
$class = $image->getAttribute('class');
if (strpos(" $class ", " productImage ") !== false) {
$url = $image->getAttribute('src');
// do stuff
}
}
Then, you can get the src attribute by using DOMElement::getAttribute():
echo $image->getAttribute('src');

How to get a div via PHP?

I get a page using file_get_contents from a remote server, but I want to filter that page and get a DIV from it that has class "text" using PHP. I started with DOMDocument but I'm lost now.
Any help?
$file = file_get_contents("xx");
$elements = new DOMDocument();
$elements->loadHTML($file);
foreach ($elements as $element) {
if( !is_null($element->attributes)) {
foreach ($element->attributes as $attrName => $attrNode) {
if( $attrName == "class" && $attrNode== "text") {
echo $element;
}
}
}
}
Once you have loaded the document to a DOMDocument instance, you can use XPath queries on it -- which might be easier than going yourself through the DOM.
For that, you can use the DOMXpath class.
For example, you should be able to do something like this :
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
var_dump($tag->textContent);
}
(Not tested, so you might need to adapt the XPath query a bit...)
Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = file_get_html('http://scrapeyoursite.com');
$html->find('div.text')->plaintext;
Pretty simple, huh? It accommodates selectors like jQuery :)
you can use simple_html_dom like here simple_html_dom doc
or use my code like here :
include "simple_html_dom.php";
$html = new simple_html_dom();
$html->load_file('www.yoursite.com');
$con_div = $html->find('div',0);//get value plaintext each html
echo the $con_div in plaintext..
$con_div->plaintext;
it's mean you will find the first div in array ('div',0) and show it in plaintext..
i hope it help you :cheer

Categories