Get img src inside an a href html dom parser - php

i am using the code bellow to get some data from an html with php simple html dom parser.
almost everything works great... the issue that i am facing is that i cant grab img src... my code is:
foreach($html->find('article') as $article) {
$item['title'] = $article->find('.post-title', 0)->plaintext;
$item['thumb'] = $article->find('.post-thumbnail', 0)->plaintext;
$item['details'] = $article->find('.entry p', 0)->plaintext;
echo "<strong>img url:</strong> " . $item['thumb'];
echo "</br>";
}
My Posts structure:
<article class="item-list item_1">
<h2 class="post-title">my demo post 1</h2>
<p class="post-meta">
<span class="tie-date">2 mins ago</span>
<span class="post-comments">
</span>
</p>
<div class="post-thumbnail">
<a href="http://localhost/mydemosite/category/sports/demo-post/" title="my demo post 1" rel="bookmark">
<img width="300" height="160" src="http://localhost/mydemosite/wp-content/uploads/demo-post-300x160.jpg" class="attachment-tie-large wp-post-image" alt="my demo post 1">
</a>
</div>
<!-- post-thumbnail /-->
<div class="entry">
<p>Hello world... this is a demo post description, so if you want to read more...</p>
<a class="more-link" href="http://localhost/mydemosite/category/sports/demo-post">Read More »</a>
</div>
<div class="clear"></div>
</article>

When you use .post-thumbnail you are getting the div element.
To get the src of the img element, use this:
$item['imgurl'] = $article->find('.post-thumbnail img', 0)->src;
I added the img selector and outputing the src directly into the variable.

Related

PHP replace image src and add a new attribute in image tag from a string containing different html tags [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have a site where i get products description from database and decode html like this in PHP and display it on webpage frontend:
$data['description'] = html_entity_decode($product_info['description'], ENT_QUOTES, 'UTF-8');
It returns html like the following:
<div class="container">
<div class="textleft">
<p>
<span style="font-size:medium">
<strong>Product Name:</strong>
</span>
<br />
<span style="font-size:14px">Some description here Click here to see full details.</span>
</p>
</div>
<div class="imageblock">
<a href="some-link">
<img src="http://myproject.com/image/catalog/image1.jpg" style="width: 500px; height: 150px;" />
</a>
</div>
<div style="clear:both">
</div>
<div class="container">
<div class="textleft">
<p>
<span style="font-size:medium">
<strong>Product Name:</strong>
</span>
<br />
<span style="font-size:14px">Some description here Click here to see full details.</span>
</p>
</div>
<div class="imageblock">
<a href="some-link">
<img src="http://myproject.com/image/catalog/image2.jpg" style="width: 500px; height: 150px;" />
</a>
</div>
<div style="clear:both">
</div>
There could be many images in the product description. I have added just 2 in my example. What I need to do is replace src of every image with src="image/catalog/blank.gif" for all images and add a new attribute
data-src="http://myproject.com/image/catalog/image1.jpg"
for image 1 and
data-src="http://myproject.com/image/catalog/image2.jpg"
for image 2. data-src attribute should get the original src value of each image. How can I achieve that?
I have tried preg_replace like following
$data['description'] = preg_replace('((\n)?src="\b.*?")', 'src="image/catalog/blank.gif', $data['description']);
It replaces src attribute of every image, but how can i add data-src with original image path. I need this before page load, so is there any way to do it with PHP?
Simply adjust your regular expression. Capture the text you want using (parentheses), then reference to that group 1 using $1 or \1.
preg_replace('(src="(.*?)")', 'src="image/catalog/blank.gif" data-src="$1"', $data['description']);
Demo: https://repl.it/repls/SpottedZanyDiscussion
I think this might be what you are looking for:
http://php.net/manual/en/domdocument.getelementsbytagname.php
$data['description'] = html_entity_decode($product_info['description'], ENT_QUOTES, 'UTF-8');
$doc = new DOMDocument();
$doc->loadHTML($data['description']);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
$old_src = $tag->getAttribute('src');
$new_src_url = 'image/catalog/blank.gif';
$tag->setAttribute('src', $new_src_url);
$tag->setAttribute('data-src', $old_src);
}
$data['description'] = $doc->saveHTML();
I havn't tested this though, so don't just copy and paste.

How to get data attribute value?

I have a url within a data-attribute and I need to get the first one:
<div class="carousel-cell">
<img onerror="this.parentNode.removeChild(this)"; class="carousel-cell-image" data-flickity-lazyload="http://esportareinsvizzera.com/site/wp-content/uploads/8.jpg">
</div>
<div class="carousel-cell">
<img onerror="this.parentNode.removeChild(this);" class="carousel-cell-image" data-flickity-lazyload="http://www.finanziamentiprestitimutui.com/wp-content/uploads/2014/09/esportazioni-finanziamento-credito.jpg">
</div>
<div class="carousel-cell">
<img onerror="this.parentNode.removeChild(this);" class="carousel-cell-image" data-flickity-lazyload="http://www.infologis.biz/wp-content/uploads/2013/09/Export.jpg">
</div>
<div class="carousel-cell">
<img onerror="this.parentNode.removeChild(this);" class="carousel-cell-image" data-flickity-lazyload="http://www.cigarettespedia.com/images/2/25/Esportazione_horizontal_name_ks_20_s_green_italy.jpg">
</div>
I have been reading lots of answers like this one and this one but I am not a php guy.
I was using this to get the first img but now I need the actual data attribute value instead
<?php
$custom_image = usp_get_meta(false, 'usp-custom-4');
$custom_image = htmlspecialchars_decode($custom_image);
$custom_image = nl2br($custom_image);
$custom_image = preg_replace('/<br \/>/iU', '', $custom_image);
preg_match('/<img.+src=[\'"](?P<src>.+?)[\'"].*>/i',$custom_image, $image);
?>
<img src="<?php echo $image['src']; ?>" alt="<?php the_title(); ?>">
Use DOMDocument to parse the HTML, get the elements corresponding to img tags and get the data-flickity-lazyload attribute of the first img tag:
...
$DOM = new DOMDocument;
$DOM->loadHTML($custom_image);
$items = $DOM->getElementsByTagName('img');
$mySrc = $items->item(0)->getAttribute('data-flickity-lazyload');

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

php remove tags before a specified tag

I want to remove all image-tags before the headline starts, but they are not nested the same way. And then remove the empty tags.
<div class="c2">
<img src="image/file" width="480" height="360" alt="Image" />
</div>
<div class="c2">
<div class="headline">
headline
</div>
<div class="headline">
headline2
</div>
</div>
and different nested tags like
<div class="c2">
<p>
<img src="image/A.JPG" width="480" height="319" alt="Image" />
</p>
<div class="headline">
A headline
</div>
</div>
i think that could be solved recursively, but i dont know how.
Thanks for your help!
EDIT: if you want to remove only <img> followed by <div><div class="headline>" or <div class="headline">, use this xpath:
$imgs = $xpath->query("//img[../following-sibling::div[1]/div/#class='headline' or ../following-sibling::div[1]/#class='headline']");
see it working: http://codepad.viper-7.com/QhprLP
Do it like this:
$doc = new DOMDocument();
$doc->loadHTML($x); // assuming HTML in $x
$xpath = new DOMXpath($doc);
$imgs = $xpath->query("//img"); // select all <img> nodes
foreach ($imgs as $img) { // loop through list of all <img> nodes
$parent = $img->parentNode;
$parent->removeChild($img); // delete <img> node
if ($parent->childNodes->length >= 1) // if parent node of <img> is empty delete it
$parent->parentNode->removeChild($parent);
}
echo htmlentities($doc->saveHTML()); // display the new HTML
see it working: http://codepad.viper-7.com/350Hw6

how do I get sets of data with xpath

My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?

Categories