Get background image from webpage using DOM XPATH - php

I'm reading a webpage using PHP DOM/XPath and I've managed to get the text I need, but now I'm trying to get the src of the main image but I can't get it.
Also to complicate things, the source is different to the inspector.
Here is the source:
<div id="bg">
<img src="https://example.com/image.jpg" alt=""/>
</div>
And here is the element in the inspector:
<div class="media-player" id="media-player-0" style="width: 320px; height: 320px; background: url("https://example.com/image.jpg") center center / cover no-repeat rgb(208, 208, 208);" currentmouseover="16">
I've tried:
$img = $xpath->evaluate('substring-before(substring-after(//div[#id=\'bg\']/img, "\')")');
and
$img = $xpath->evaluate('substring-before(substring-after(//div[#class=\'media-player\']/#style, "background: url(\'"), "\')")');
but get nothing from either.
Here is my complete code:
$html = file_get_contents($externalurl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$allChildNodesFromDiv = $xpath->query('//h1[#class="artist"]');
$releasetitle = $allChildNodesFromDiv->item(0)->textContent;
echo "</br>Title: " . $releasetitle;
$img = $xpath->evaluate('substring-before(substring-after(//div[#class=\'media-player\']/#style, "background: url(\'"), "\')")');
echo $image;
$img = $xpath->evaluate('substring-before(substring-after(//div[#id=\'bg\']/img, "\')")');
echo $image;

Not something I would normally suggest, but as the particular content you are after is loaded from javascript, BUT the content is in <script> tags, then it may be an easy one for a regex to extract. From your comment...
Ah yes, it appears in: poster :
'https://284fc2d5f6f33a52cd9f-ce476c3c56a27f320262daffab84f1af.ssl.cf3.rackcdn.com/artwork_5e74a44e1e004_CHAMPDL879D_5e74a44e4672b.jpg'
So this code looks the value of poster : '...',.
$html = file_get_contents($externalurl);
preg_match("/poster : '(.*)',/", $html, $matches);
echo $matches[1];
This can be prone to changes in the html, but it may work for now.

Related

preg_replace img src, width, height stack overflow

This's my codes:
$content = '<p><img src="http://localhost/contents/uploads/sdadaasa.jpg" width="1500" height="900"></p>';
$content = preg_replace('/<p><img.+src=[\'"]([^\'"]+)[\'"].*>/i', "<p class=\"the-image\"><img class=\"lazy-load\" src=\"$1\" width=\"\" height=\"\"/></p>", $content);
return $content;
My code is add a class for <p> tag and <img> tag.
Now i want to also get the width and height from $content because my code is removing the width and height attribute.
To parser HTML it's better to use any library as, for example, DomDocument
$content = '<p><img src="http://localhost/contents/uploads/sdadaasa.jpg" width="1500" height="900"></p>';
$dom = new DomDocument();
$dom->loadHTML($content);
$p = $dom->getElementsByTagName('p')->item(0);
$p->setAttribute('class', 'the-image');
$img = $p->getElementsByTagName('img')->item(0);
$img->setAttribute('class', 'lazy-load');
echo $dom->saveHTML($p);
// <p class="the-image"><img src="http://localhost/contents/uploads/sdadaasa.jpg" width="1500" height="900" class="lazy-load"></p>

Replace empty alt tag on <img> tag

I would like to replace my empty alt tags on images in a string. I have a string that contains all the text for a curtain page. In the text are also images, and a lot of them have empty tags (old data), but most of the time they do have title tags.
For example:
<img src="assets/img/test.png" alt="" title="I'am a title tag" width="100" height="100" />
What I wish to have:
<img src="assets/img/test.png" alt="" title="I'am a title tag" alt="I'am a title tag" width="100" height="100" />
So:
I need to find all the images in my string, loop trough the images, find title tags, find alt tags, and replace the empty alt tags with the title tags that do have a value.
This is what i tried:
preg_match_all('/<img[^>]+>/i',$return, $text);
if(isset($text)) {
foreach( $text as $itemImg ) {
foreach( $itemImg as $item ) {
$array = array();
preg_match( '/title="([^"]*)"/i', $item, $array );
if(isset($array[1])) {
//So $array[1] is a title tag, now what?
}
}
}
}
I don't know have to complete the code, and I think there must be a easier fix for this. Suggestions?
Using Regex is not a good approach you should use DOMDocument for parsing HTML. Here we are querying on those elements whose alt attribute is empty which is actually asked in question.
Try this code snippet here
<?php
ini_set('display_errors', 1);
$string=<<<HTML
<img src="assets/img/test1.png" alt="" title="I'am a title tag" width="100" height="100" />
<img src="assets/img/test2.png" alt="" title="I'am a title tag" width="100" height="100" />
<img src="assets/img/test3.png" alt="" title="I'am a title tag" width="100" height="100" />
HTML;
$domDocument = new DOMDocument();
$domDocument->loadHTML($string,LIBXML_HTML_NODEFDTD);
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query('//img[#alt=""]');
foreach($results as $result)
{
$title=$result->getAttribute("title");
$result->setAttribute("alt",$title);
echo $domDocument->saveHTML($result);
echo PHP_EOL;
}
maybe you could use Javascript for this kind of things with jquery
like:
$('img').each(function(){
$(this).attr('alt', $(this).attr('title'));
});
hope it helps
Regards.
What you want here is an HTML parser library that can manipulate HTML and then save it again. By using regular expressions to modify HTML markup, you're setting yourself up for a mess.
The DOM module built into PHP offers this functionality: http://php.net/manual/en/book.dom.php
Here's an example (cribbed from this article):
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
}
$html = $dom->saveHTML();
You can use DOMDocument to achieve your requirement. Below is one of the sample code for your reference:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
?>
Please try it below:
function img_title_in_alt($full_img_tag){
$doc = new DOMDocument();
$doc->loadHTML($full_img_tag);
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
if($tag->getAttribute('src')!==''){
return '<img src="'.$tag->getAttribute('src').'" width="'.$tag->getAttribute('width').'" height="'.$tag->getAttribute('height').'" alt="'.$tag->getAttribute('title').'" title="'.$tag->getAttribute('title').'" />';
}
}
}
Now call the function with your full html tag of image. See the example:
$image = '<img src="assets/img/test.png" alt="" title="I\'am a title tag" width="100" height="100" />';
print img_title_in_alt($image);
Let me know if you do not understand anything.

getting img src within a link

I am having trouble with retrieving the src of an image that is part of a link. For example with this I would like to retrieve the src of the img between the tag.
<img src="http://example.com/picture1234.jpg" id="pic_1234" />
I will need to do this for a couple of the links on the page that are all laid out the same. So what I tried so far is this:
$dom = new DOMDocument;
#$dom->loadHTML($html);
$i = 0;
$links = $dom->getElementsByTagName('a');
//Get images
foreach ($links as $link){
$test = $link->getAttribute('href');
if (strpos($test,'/video') !== false) {
$XV_IMG[$i] = $link->nodeValue;
$i++;
}
}
If the link does not contain an img tag only, but instead it has plain text it will work just fine. Is there any way to get the src?
Just keep using getElementsByTagName on the node like this
foreach ($link->getElementsByTagName('img') as $img) {
$XV_IMG[] = $img->getAttribute('src');
}
try to use preg_match_all
$html= '<img src="http://example.com/picture1234.jpg" id="pic_1234" />
<img src="http://example.com/picture1224.jpg" id="pic_1224" />
<img src="http://example.com/picture1434.jpg" id="pic_1434" />
<img src="http://example.com/picture1554.jpg" id="pic_1554" />
<img src="http://example.com/picture1334.jpg" id="pic_1334" />';
preg_match_all('/<a href="(.*)"><img src="(.*)" id="pic_[0-9]{1,7}" \/><\/a>/i',$html,$out);
unset($out[0]);
unset($out[1]);
print_r($out);

php: replace every img src in some condition

i try to replace all img src that not contain full url with full image url
example like this
<?php
$html_str = "<html>
<body>
Hi, this is the first image
<img src='image/example.jpg' />
this is the second image
<img src='http://sciencelakes.com/data_images/out/14/8812836-green-light-abstract.jpg' />
and this is the last image
<img src='image/last.png' />
</body>
</html>";
?>
and when replace became like this
<?php
$html_str = "<html>
<body>
Hi, this is the first image
<img src='http://example.com/image/example.jpg' />
this is the second image
<img src='http://sciencelakes.com/data_images/out/14/8812836-green-light-abstract.jpg' />
and this is the last image
<img src='http://example.com/image/last.png' />
</body>
</html>";
?>
so how to check every img src that not full link and replace it ? ( the $html_str is dynamic based on mysql )
please give me some solution for this problem
thanks
I'd do it properly using a DOM library, eg
$doc = new DOMDocument();
$doc->loadHTML($html_str);
$xp = new DOMXPath($doc);
$images = $xp->query('//img[not(starts-with(#src, "http:") or starts-with(#src, "https:") or starts-with(#src, "data:"))]');
foreach ($images as $img) {
$img->setAttribute('src',
'http://example.com/' . ltrim($img->getAttribute('src'), '/'));
}
$html = $doc->saveHTML($doc->documentElement);
Demo here - http://ideone.com/4K9pyD
Try this:
You can get image source using following code:
$xpath = new DOMXPath(#DOMDocument::loadHTML($html));
$src = $xpath->evaluate("string(//img/#src)");
After that check string contains http or not. According do the operation.

How to use preg in php to add html properties

I'm looking to write a script in php that scans an html document and adds new markup to a element based on what it finds. More specifically, I was it to scan the document and for every element it searches for the CSS markup "float: right/left" and if it locates it, it adds align="right/left" (based on what it finds).
Example:
<img alt="steve" src="../this/that" style="height: 12px; width: 14px; float: right"/>
becomes
<img alt="steve" src="../this/that" align="right" style="height: 12px; width: 14px; float: right"/>
$dom = new DOMDocument();
$dom->loadHTML($htmlstring);
$x = new DOMXPath($dom);
foreach($x->query("//img[contains(#style,'float: right']") as $node) $node->setAttribute('align','right');
foreach($x->query("//img[contains(#style,'float: left']") as $node) $node->setAttribute('align','left');
edit:
When there is no certainty of amount of space between 'float:' & 'right', there are several options:
Use the XPath 1.0: //img[starts-with(normalize-space(substring-after(#style,'float:')),'right')]
Just do a simple check for float like //img[contains(#style,'float:'], and check with $node->getAttribute() what actually comes afterwards.
Import preg_match into the equasion (which was just recently pointed out to me (thanks Gordon), but in this case is imho the least favorite solution):
.
$dom = new DOMDocument();
$dom->loadHTML($htmlstring);
$x = new DOMXPath($dom);
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions('preg_match');
foreach($x->query("//img[php:functionString('preg_match','/float\s*:\s*right/',#style)]") as $node) $node->setAttribute('align','right');
Please please, don't use a regexp to parse HTML.
Use simple_html_dom instead.
$dom = new simple_html_dom();
$dom->load($html);
foreach ($dom->find("[style=float: left],[style=float: right]") as $fragment)
{
if ($fragment[0]->style == 'float:left')
{
$fragment[0]->align='left';
$fragment[0]->style = '';
}
...
}
echo $dom;

Categories