Getting element inside other element by class php DOMDocument - php

Hi Guys i do have this Html Code :
<div class="post-thumbnail2">
<a href="http://example.com" title="Title">
<img src="http://linkimgexample/image.png" alt="Title"/>
</a>
</div>
I want to get the value of src image (http://linkimgexample/image.png) and the value of the href link (http://example.com) using php DOMDocument
what i did to get the link was something like that :
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
$cl = $div->getAttribute("class");
if ($cl == "post-thumbnail2") {
$links = $div->getElementsByTagName("a");
foreach ($links as $link)
echo $link->getAttribute("href")."<br/>";
}
}
i could do the same for src img
$imgs = $div->getElementsByTagName("img");
foreach ($imgs as $img)
echo $img->getAttribute("src")."<br/>";
but sometime in the website there is no image and the Html code is like that :
<div class="post-thumbnail2">
</div>
so my questions is how could i get the 2 value at the same time it means when there is no image i show some message
to be more clear this is an example :
<div class="post-thumbnail2">
<a href="http://example1.com" title="Title">
<img src="http://linkimgexample/image1.png" alt="Title"/>
</a>
</div>
<div class="post-thumbnail2">
</div>
<div class="post-thumbnail2">
<a href="http://example3.com" title="Title">
<img src="http://linkimgexample/image2.png" alt="Title"/>
</a>
</div>
i want the result to be
http://example1.com - http://linkimgexample/image1.png
http://example2.com - there is no image here !
http://example3.com - http://linkimgexample/image2.pn

DOMElement::getElementsByTagName returns a DOMNodeList, that means you can find out if a img-element was found by checking the length property.
$imgs = $div->getElementsByTagName("img");
if($imgs->length > 0) {
foreach ($imgs as $img)
echo $img->getAttribute("src")."<br/>";
} else {
echo "there is no image here!<br/>";
}
You should think about using XPath - it makes your life traversing the DOM a bit easier:
$doc = new DOMDocument();
if($doc->loadHtml($xmlData)) {
$xpath = new DOMXPath($doc);
$postThumbLinks = $xpath->query("//div[#class='post-thumbnail2']/a");
foreach($postThumbLinks as $link) {
$imgList = $xpath->query("./img", $link);
$imageLink = "there is no image here!";
if($imgList->length > 0) {
$imageLink = $imgList->item(0)->getAttribute('src');
}
echo $link->getAttribute('href'), " - ", $link->getAttribute('title'),
" - ", $imageLink, "<br/>", PHP_EOL;
}
} else {
echo "can't load HTML document!", PHP_EOL;
}

Related

PHP DomXpath xpath query of Child Node

I'm trying to use xpath to query some HTML:
<a target="_blank" class="dx-smart-widget-grid-item_113_20" href="https://link.com" title="Rules for the Road to One Source of Truth' with Jaguar Land Rover and Spark44">
<div class="dx-smart-widget-grid-info_113_20">
<img class="dx-smart-widget-report-cover_113_20" src="https://imagelink.com/preview.png" alt="The Alternative Text"/>
<div class="dx-smart-widget-grid-text_113_20">
<div class="dx-smart-widget-grid-title_113_20">The Alternative Text</div>
</div>
<span class="dx-smart-widget-report-assettype_113_20">On-Demand Webinar</span>
<img class="dx-smart-widget-partner-logo_113_20" src="https://logopath/logo.png" alt="censhare"/>
</div>
</a>
This is the code I'm using:
# $dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//a[contains(#class,'dx-smart-widget-grid-item_113_20')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<strong>Link: </strong>". $element->getAttribute('href'). "<br />";
echo "<strong>Title: </strong>". $element->getAttribute('title'). "<br />";
$images = $xpath->query("//img[contains(#class,'dx-smart-widget-report-cover_113_20')]", $element);
echo "<strong>Image: </strong>".$images->getAttribute('src'). "<br />";
}
}
I'm gettin the href and title fine... but trying to query the image just isn't working. It actually breaks.
Any help would be appreciated.
Assuming there is only 1 matching image, you can use XPaths evaluate() and string() in the XPath expression to extract the value in one go...
$images = $xpath->evaluate("string(//img[contains(#class,'dx-smart-widget-report-cover_113_20')]/#src)", $element);
echo "<strong>Image: </strong>".$images. "<br />";
You are almost there. You just need to iterate over $images in a foreach loop. So replace
echo "<strong>Image: </strong>".$images->getAttribute('src'). "<br />";
with
foreach ($images as $image) {
echo "<strong>Image: </strong>".$image->getAttribute('src'). "<br /and i>";
};
and it should work.

PHP string search and replace - possible use of DOM Needed

I cant seem to figure out how to achieve my goal.
I want to find and replace a specific class link based off of a generated RSS feed (need the option to replace later no matter what link is there)
Example HTML:
<a class="epclean1" href="#">
WHAT IT SHOULD LOOK LIKE:
<a class="epclean1" href="google.com">
May need to incorporate get element using DOM as the Full php has a created document. If that is the case I would need to know how to find by class and add the href url that way.
FULL PHP:
<?php
$rss = new DOMDocument();
$feed = array();
$urlArray = array(array('url' => 'https://feeds.megaphone.fm')
);
foreach ($urlArray as $url) {
$rss->load($url['url']);
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue
);
array_push($feed, $item);
}
}
usort( $feed, function ( $a, $b ) {
return strcmp($a['title'], $b['title']);
});
$limit = sizeof($feed);
$previous = null;
$count_firstletters = 0;
for ($x = 0; $x < $limit; $x++) {
$firstLetter = substr($feed[$x]['title'], 0, 1); // Getting the first letter from the Title you're going to print
if($previous !== $firstLetter) { // If the first letter is different from the previous one then output the letter and start the UL
if($count_firstletters != 0) {
echo '</ul>'; // Closing the previously open UL only if it's not the first time
echo '</div>';
}
echo '<button class="glanvillecleancollapsible">'.$firstLetter.'</button>';
echo '<div class="glanvillecleancontent">';
echo '<ul style="list-style-type: none">';
$previous = $firstLetter;
$count_firstletters ++;
}
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
echo '<li>';
echo '<a class="epclean'.$i++.'" href="#" target="_blank">'.$title.'</a>';
echo '</li>';
}
echo '</ul>'; // Close the last UL
echo '</div>';
?>
</div>
</div>
The above fullphp shows on site like so (this is shortened as there is 200+):
<div class="modal-glanvillecleancontent">
<span class="glanvillecleanclose">×</span>
<p id="glanvillecleaninstruct">Select the first letter of the episode that you wish to get clean version for:</p>
<br>
<button class="glanvillecleancollapsible">8</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean1" href="#" target="_blank">80's Video Vixen Tawny Kitaen 044</a></li>
</ul>
</div>
<button class="glanvillecleancollapsible">A</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean2" href="#" target="_blank">Abby Stern</a></li>
<li><a class="epclean3" href="#" target="_blank">Actor Nick Hounslow 104</a></li>
<li><a class="epclean4" href="#" target="_blank">Adam Carolla</a></li>
<li><a class="epclean5" href="#" target="_blank">Adrienne Janic</a></li>
</ul>
</div>
You're not very clear about how your question relates to the code shown, but I don't see any attempt to replace the attribute within the DOM code. You'd want to look at XPath to find the desired elements:
function change_clean($content) {
$dom = new DomDocument;
$dom->loadXML($content);
$xpath = new DomXpath($dom);
$nodes = $xpath->query("//a[#class='epclean1']");
foreach ($nodes as $node) {
if ($node->getAttribute("href") === "#") {
$node->setAttribute("href", "https://google.com/");
}
}
return $dom->saveXML();
}
$xml = '<?xml version="1.0"?><foo><bar><a class="epclean1" href="#">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>';
echo change_clean($xml);
Output:
<foo><bar><a class="epclean1" href="https://google.com/">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>
Hmm. I think your pattern and replacement might be your problem.
What you have
$pattern = 'class="epclean1 href="(.*?)"';
$replacement = 'class="epclean1 href="google.com"';
Fix
$pattern = '/class="epclean1" href=".*"/';
$replacement = 'class="epclean1" href="google.com"';

html DOM program to find href value

I am a newbie in php and I have been assigned with a project to fetch the HREF value from the following HTML snippet:
<p class="title">
<a href="http://canon.com/">Canon Pixma iP100 + Accu Kit
</a>
</p>
Now for this am using the following code:
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('p') as $link) {
# Show the <a href>
foreach($link->getElementsByTagName('a') as $link)
{
echo $link->getAttribute('href');
echo "<br />";
}
}
This code gives me the HREF value of all <a href> from all the <P> tag in that page. I want to parse the <P> with the class "title" only...I can't use Simple_HTML_DOM or any kind of library here.
Thanks in advance.
Alternatively, you could use DOMXpath for this one. Like this:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
// target p tags with a class with "title" with an anchor tag
$target_element = $xpath->query('//p[#class="title"]/a');
if($target_element->length > 0) {
foreach($target_element as $link) {
echo $link->getAttribute('href'); // http://canon.com/
}
}
Or If if you want to traverse it. Then you need to have to search it manually.
foreach($dom->getElementsByTagName('p') as $p) {
// if p tag has a "title" class
if($p->getAttribute('class') == 'title') {
foreach($p->childNodes as $child) {
// if has an anchor children
if($child->tagName == 'a' && $child->hasAttribute('href')) {
echo $child->getAttribute('href'); // http://cannon.com
}
}
}
}

Get img src with PHP Simple HTML DOM

Demo
I need to get the image src from the following code
HTML
<div class="avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ">
<a onclick="">
<img id="lazyload_-247847544_0" height="74" width="74" class="avatar potentialFacebookAvatar avatarGUID:CF48B2B4A31B43EC96F0561F498CE6BF" src="http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg" />
</a>
</div>
I tried writing the js:
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
$img = $element->getAttribute('src');
echo $img;
}
But it shows src key doesn't exists. How can I scrap review avatar images?
UPDATE:
The image url is not found when I looked at the page source, But firebug shows the image url:
<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>
Here is my page's source code:
<div class="col1of2">
<div class="member_info">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-SRC_175428572" class="memberOverlayLink" onmouseover="ta.trackEventOnPage('Reviews','show_reviewer_info_window','user_name_photo'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', 0, (new Element(this)).getElement('.avatar')&&(new Element(this)).getElement('.avatar').getStyle('border-radius')=='100%'?-10:0);">
<div class="avatar profile_3E0FAF58557D3375508A9E5D9A7BD42F ">
<a onclick=>
<img id='lazyload_1953171323_15' height='74' width='74' class='avatar potentialFacebookAvatar avatarGUID:3E0FAF58557D3375508A9E5D9A7BD42F'/>
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname hvrIE6 mbrName_3E0FAF58557D3375508A9E5D9A7BD42F" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Prataspeles</span>
</div>
</div>
<div class="location">
Latvia
</div>
</div>
<div class="memberBadging">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-CONT" class="totalReviewBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','review_count'); ta.util.cookie.setPIDCookie('15984'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -10, -50);">
<div class="reviewerTitle">Reviewer</div>
<img id='lazyload_1953171323_16' height='24' alt='4 reviews' width='25' class='icon lazy'/>
<span class="badgeText">4 reviews</span>
</div>
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-HV" class="helpfulVotesBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','helpful_count'); ta.util.cookie.setPIDCookie('15983'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -22, -50);">
<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>
<span class="badgeText">4 helpful votes</span>
</div>
</div>
</div>
Is there any problem because of using lazyload?
UPDATE 2
Using lazyload makes my images load once the pages are loaded, i tried getting image ids and compare them with the lazyload js array, but this id doesn't match with the lazyload var array.
Question:
How to get this js array from this JSON?
Example:
{"id":"lazyload_-205858383_0","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg"}
, {"id":"lazyload_-205858383_1","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
, {"id":"lazyload_-205858383_2","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2a/fd/98/avatar.jpg"}
, {"id":"lazyload_-205858383_3","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
, {"id":"lazyload_-205858383_4","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/5e/avatar036.jpg"}
, {"id":"lazyload_-205858383_5","tagType":"img","scroll":false,"priority":100,"data":"http://c1.tacdn.com/img2/badges/badge_helpful.png"}
You are having difficulty because javascipt is used to lazy load the image once the page is loaded. Use phpDom to find the Id of the element, and then use regular expression to find the relevant images based on this Id.
To achieve this, try something like :
$json = json_decode("<JSONSTRING HERE>");
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
$imgId = $element->getAttribute('id');
foreach ($json as $lazy)
{
if ($lazy["id"] == $imgId) echo $lazy["data"];
}
}
The above is untested so you will need to resolve the kinks. They key is to extract the relevant javascript and convert it to json.
Alternatively, you can use string search functions to get the row which contains the information about the img, and extract the required value.
If you're looking for all IDs that contain the substring, "lazyload", you might try the wildcard selector and upon a hit look at the 'src' property of the element found. See the jsfiddle below. Good luck!
$(document.body).find('img[id*=lazyload]').each(function() {
console.log($(this).prop('src'));
});
Jsfiddle
Try this -
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img') as $element) {
$img = $element->getAttribute('src');
echo $img;
}
There is space after the class name. You have to add space at the end of class name.
OR
use even full class name
$html->find('div[class=avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img'
Use jQuery selectors i.e. $('#lazyload_-247847544_0') and you can get the image source using this
var src = $('#lazyload_-247847544_0').attr('src');
Or more specifically
$('.profile_CF48B2B4A31B43EC96F0561F498CE6BF #lazyload_-247847544_0').attr('src');
Thanks
function getReviews(){
$url = 'http://www.tripadvisor.com/Hotel_Review-g274965-d952833-Reviews-Ezera_Maja-Liepaja_Kurzeme_Region.html';
$html = new simple_html_dom();
$html = file_get_html($url);
$array = array();
$i = 0;
// IMG ID
foreach($html->find('div[class=avatar] a img') as $element) { $array[$i]['id'] = $element->getAttribute('id'); $i++;} unset($i);$i = 0;
// IMG SRC
$p1 = strpos( $html, 'var lazyImgs =' ) + 14;
$p2 = strpos( $html, ']', $p1 );
$raw = substr( $html, $p1, $p2 - $p1 ) . ']';
$images = json_decode($raw);
foreach ($images as $image){
$id = $image->id;
$data = $image->data;
foreach ($array as $element){
if ( isset($element['id']) && $element['id'] == $id){
$array[$i]['image'] = $data;
$i++;
}
}
}
$html->clear();
unset($html);
return $array;
}
Get IMG ID in array. Then scrach var Lazyload in json and decode. Then compare 2 arrays and if id mach add data to array.
Thanks to everybody!

PHP regex to check if image is wrapped with a tag

I am creating a wordpress function and need to determine whether an image in the content is wrapped with an a tag that contains a link to a PDF or DOC file e.g.
<img src="../images/image.jpg" />
How would I go about doing this with PHP?
Thanks
I would very strongly advise against using a regular expression for this. Besides being more error prone and less readable, it also does not give you the ability to manipulate the content easily.
You would be better of loading the content into a DomDocument, retrieving all <img> elements and validating whether or not their parents are <a> elements. All you would have to do then is validate whether or not the value of the href attribute ends with the desired extension.
A very crude implementation would look a bit like this:
<?php
$sHtml = <<<HTML
<html>
<body>
<img src="../images/image.jpg" />
<img src="../images/image.jpg" />
<img src="../images/image.jpg" />
<p>this is some text <a href="site.com/doc.pdf"> more text</p>
</body>
</html>
HTML;
$oDoc = new DOMDocument();
$oDoc->loadHTML($sHtml);
$oNodeList = $oDoc->getElementsByTagName('img');
foreach($oNodeList as $t_oNode)
{
if($t_oNode->parentNode->nodeName === 'a')
{
$sLinkValue = $t_oNode->parentNode->getAttribute('href');
$sExtension = substr($sLinkValue, strrpos($sLinkValue, '.'));
echo '<li>I am wrapped in an anchor tag '
. 'and I link to a ' . $sExtension . ' file '
;
}
}
?>
I'll leave an exact implementation as an exercise for the reader ;-)
Here is a DOM parse based code that you can use:
$html = <<< EOF
<img src="../images/image.jpg" />
<img src="../images/image1.jpg" />
<IMG src="../images/image2.jpg" />
<img src="../images/image3.jpg" />
My PDF
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$nodeList = $doc->getElementsByTagName('a');
for($i=0; $i < $nodeList->length; $i++) {
$node = $nodeList->item($i);
$children = $node->childNodes;
$hasImage = false;
foreach ($children as $child) {
if ($child->nodeName == 'img') {
$hasImage = true;
break;
}
}
if (!$hasImage)
continue;
if ($node->hasAttributes())
foreach ($node->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
if ($attr->nodeName == 'href' &&
preg_match('/\.(doc|pdf)$/i', $attr->nodeValue)) {
echo $attr->nodeValue .
" - Image is wrapped in a link to a PDF or DOC file\n";
break;
}
}
}
Live Demo: http://ideone.com/dwJNAj

Categories