retrieve all images path from string with PHP - php

How do I get all images path from a string?
Note I just want the path containing the word "media".
For example given this string (part of the DOM)
<div class="my-class">
<img src="http://my-website.com/cache/media/2017/10/img67.jpeg" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/2017/10/img68.png" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/media/2017/10/img69.jpg" class="" alt="test" width="120" height="100">
<h2 class="uk-margin-top-remove">About us</h2>
</div>
I want an array containing a similar result:
array(
[0] => "http://my-website.com/cache/media/2017/10/img67.png"
[1] => "http://my-website.com/cache/media/2017/10/img69.png"
);
I don't want the second img because src attribute doesn't contain the word "media".

You could use preg_match_all() to get URLs but it is even better to use a DOM reader.
$str = '<div class="my-class">
<img src="http://my-website.com/cache/media/2017/10/img67.jpeg" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/2017/10/img68.png" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/media/2017/10/img69.jpg" class="" alt="test" width="120" height="100">
<h2 class="uk-margin-top-remove">About us</h2>
</div>' ;
$matches = [] ;
preg_match_all('~(http\://my-website\.com/cache/media/(.*?))"~i', $str, $matches) ;
var_dump($matches[1]);
Will returns :
array(2) {
[0]=>
string(52) "http://my-website.com/cache/media/2017/10/img67.jpeg"
[1]=>
string(51) "http://my-website.com/cache/media/2017/10/img69.jpg"
}

Some boilerplate code to get you started:
<?php
$data = <<<DATA
<div class="my-class">
<img src="http://my-website.com/cache/media/2017/10/img67.jpeg" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/2017/10/img68.png" class="" alt="test" width="120" height="100">
<img src="http://my-website.com/cache/media/2017/10/img69.jpg" class="" alt="test" width="120" height="100">
<h2 class="uk-margin-top-remove">About us</h2>
</div>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);# | LIBXML_COMPACT | LIBXML_NOENT );
# set up the xpath
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//img[contains(#src, '/media/')]/#src") as $image) {
echo $image->nodeValue . "\n";
}
Which yields
http://my-website.com/cache/media/2017/10/img67.jpeg
http://my-website.com/cache/media/2017/10/img69.jpg
This loads the DOM and uses an xpath query for every image where we'll loop over afterwards.
If for some reasons (why?) you are unable to use a DOM parser, your could use the secondbest option:
<img
(?s:(?!>).)+?
src=(['"])
(?P<src>(?:(?!\1).)+?/media/.*?\1)
And use the src group, see a demo on regex101.com.

Related

php DOMDocument preg_replace fail detect

Basically, I want to replace content with hyperlink when detected matching keyword tag.
the replace need to be outside of caption/image/figure/figcaption/iframe/a of existing content, because putting hyperlink inside these will causing format breaking.
my php
$html_content= '税务调查。
[caption id="attachment_111" align="aligncenter" width="100"]<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。[/caption]
他在声明中说:“我会非常认真地调查,往来。”
<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />
<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>
<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&href=https%3A%2F%2Fwww.facebook.com;width=100&t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>
<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>
<b>更多热点</b>
<p>halo拜登也指美国经济不会衰退</p>
<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>
go to google
<span style="color: #ff6600;"><strong>另外,拜登声明中说</strong></span>';
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
//echo $node->parentNode->nodeName . "<Br>";
if ($node instanceof DOMText && !in_array($node->parentNode->nodeName, $excludeParents)) {
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
$dom = new DOMDocument;
$internalErrors = libxml_use_internal_errors(true);
$dom->loadHTML( mb_convert_encoding($html_content, 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );
$tags = array("拜登","认真");
foreach($tags as $tag){
$tagurl= '<span class="article-tag"><a class="mytag" href="http://outside.com" >'.$tag.'</a></span>';
preg_replace_dom('/'.$tag.'/i', $tagurl, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));
$test_tag = '['.$tag.']';
//preg_replace_dom('/'.$tag.'/i', $test_tag, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));
}
function getLink($tag){
$arr = array(
"拜登"=>"http://bai.com",
"认真"=>"http://ren.com",
);
return $arr[$tag];
}
$output = mb_substr($dom->saveHTML(), 0, null, "UTF-8");
//echo $output;
echo html_entity_decode($output);
Now I facing 2 issue
want to exclude replace hyperlink tag into [caption id=...] ... [/caption]
but it fail on regex..
currently it display like this...
this DOMDocument loadHTML method will add in extra paragraph tag randomly at any places... Although I can process the output by removing ALL the paragraph tag, but it also means the final content is not original anymore. Some input content by default have some paragraph tag, so this action will end up making existing p tag gone too..
(solved) want to preg_replace as clickable hyperlink to display at browser.
but echo $output showing the pure raw hyperlink syntax, unable to click..
update on issue2, value saved into $node->nodeValue are escaped and causing pure plain text. I add in this to unescape it, echo html_entity_decode($output); and it now display correctly.
Desired output
$output= '税务调查。
[caption id="attachment_111" align="aligncenter" width="100"]<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。[/caption]
他在声明中说:“我会非常<span class="article-tag"><a class="mytag" href="http://outside.com" >认真</a></span>地调查,往来。”
<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />
<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>
<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&href=https%3A%2F%2Fwww.facebook.com;width=100&t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>
<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>
<b>更多热点</b>
<p>halo<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>也指美国经济不会衰退</p>
<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>
go to google
<span style="color: #ff6600;"><strong>另外,<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>声明中说</strong></span>';
I tried very, VERY hard to implement a DOMDocument+Xpath solution, but I came unstuck while trying to disqualify the text node within the square-tagged caption block. I couldn't manage to isolate the whole caption block to be able to exclude it. In the end, here is a caveman's regex approach to serve as a band-aid until someone smarter can solve this problem properly.
The regex matches the blacklisted tags in the text and discards them; it only replaces text that is not disqualified.
Code: (Demo)
$tags = ["拜登", "认真"];
$blacklisted = implode(
'|',
array_map(
fn($tag) => "<{$tag}[ >].+?" . ($tag === 'img' ? "/>" : "</$tag>"),
['a', 'img', 'iframe', 'figure', 'figcaption']
)
);
echo preg_replace(
sprintf('~(?:\[caption[ \]].+?\[/caption]|%s)(*SKIP)(*FAIL)|%s~us', $blacklisted, implode('|', $tags)),
'<span class="article-tag"><a class="mytag" href="http://outside.com">$0</a></span>',
$html
);

Get parts of strings with PHP

i'm trying to substract parts of a string in a variable with php, refeering in a specific text condition: class="image-GET_THIS_NUMBER" and storing it into an array. The variable is something like:
$content = '<p>
<img src="#" alt="" class="image-1">
<img src="#" alt="" class="image-96">
<img src="#" alt="" class="image-12231">
<img src="#" alt="" class="image-444312">
</p>';
And i need get this:
$images = array(1, 96, 12231, 444312);
I don't really know if it's possible to do. Hope you can help me.
You can do it with Regex.
Here you are
$content = '<p>
<img src="#" alt="" class="image-1">
<img src="#" alt="" class="image-96">
<img src="#" alt="" class="image-12231">
<img src="#" alt="" class="image-444312">
</p>';
preg_match_all("/class=\"image-([0-9]+)\"/is", $content, $matches);
$images = $matches[1];
To find all matches in your html string, a regex search will be the ticket.
preg_match_all('/class="image-(\d+)/', $content, $matches);

Make all img tags' src attributes contain absolute paths

I am trying to get/replace the image source link from the page.
Some of the page has image src='image/abc.png' so my regex fails.
What I want to do is: append the subdirectory path to main url if absolute path is not given.
i.e. if src='image/abc.png and main url is http://example.com
then it should transformed to http://example.com/image/abc.png
Note: some user may enter the url name like http://example.com/ so if I append as I did above then it will give:
http://example.com//image/abc.png which is wrong.
Can someone give me correct directions to form the exact absolute path of image?
My code:
<?php
function get_logo($html, $url) {
if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
echo "First:";
return $matches[0][0];
} else {
if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))b~im', $html, $matches)) {
echo "Second: ";
echo $matches[0][0];
return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
} else
return null;
}
}
Definitely don't use regex for this task. Using a combination of DOMDocument and XPath will make quick work of this task and the syntax is rather intuitive. If the src attribute of any <img> tag does not start with your pre-declared domain, then trim any forward slashes from the front of the src value and prepend the domain to form the absolute path.
Code: (Demo)
$html = <<<HTML
<div>
<img src="image/abc.png" alt="test" width="50" height="50">
<img src="http://example.com/image/abc.png" alt="test" width="50" height="50">
<img src="/image/abc.png" alt="test" width="50" height="50">
<iframe src="image/abc.png" alt="test" width="50" height="50"></iframe>
</div>
HTML;
$base = "http://example.com/";
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//img[not(starts-with(#src, '$base'))]") as $node) {
$node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/'));
}
echo $dom->saveHTML();
Output:
<div>
<img src="http://example.com/image/abc.png" alt="test" width="50" height="50">
<img src="http://example.com/image/abc.png" alt="test" width="50" height="50">
<img src="http://example.com/image/abc.png" alt="test" width="50" height="50">
<iframe src="image/abc.png" alt="test" width="50" height="50"></iframe>
</div>

Using regex to wrap images in tags

I've been using regex to wrap my images in < a > tags and altering their paths etc.
I know using dom for this is better, having read a lot of threads about wrapping, but I'm unable to understand how to.
This is what I'm using:
$comments = (preg_replace('#(<img.+src=[\'"]/uploads/userdirs/admin)(?:.*?/)(.+?)\.(.+?)([\'"].*?>)#i', '<a class="gallery" rel="'.$pagelink.'" href=/uploads/userdirs/'.$who.'/$2.$3>$1/mcith/mcith_$2.$3$4</a>', $comments));
It successfully wraps each image in the tags I want. But only if the string provided ($comments) has the right markup.
<p><img src="/uploads/userdirs/admin/1160501362291.png" alt="" width="1280" height="960" /></p>
<p><img src="/uploads/userdirs/admin/100_Bullets_68_1280x1024.jpg" alt="" width="1280" height="1024" /></p>
When presented like this, it works. I'm using tinymce so it wraps in < p > when I do a linebreak with enter. But when I don't do that, when I just insert images one after another so the HTML looks like this, it won't:
<p><img src="/uploads/userdirs/admin/1160501362291.png" alt="" width="1280" height="960" /><img src="/uploads/userdirs/admin/100_Bullets_68_1280x1024.jpg" alt="" width="1280" height="1024" /></p>
It will instead wrap those 2 images in the same < a > tag. Making the output look like this:
<p><a class="gallery" rel="test" href="/uploads/userdirs/admin/100_Bullets_68_1280x1024.jpg">
<img src="/uploads/userdirs/admin/1160501362291.png" alt="" width="1280" height="960">
<img src="/uploads/userdirs/admin/mcith/mcith_100_Bullets_68_1280x1024.jpg" alt="" width="1280" height="1024">
</a></p>
Which is wrong. The output I want is this:
<p><a class="gallery" rel="test2" href="/uploads/userdirs/admin/100_Bullets_68_1280x1024.jpg"><img src="/uploads/userdirs/admin/mcith/mcith_100_Bullets_68_1280x1024.jpg" alt="" width="1280" height="1024"></a></p>
<p><a class="gallery" rel="test2" href="/uploads/userdirs/admin/1154686260226.jpg"><img src="/uploads/userdirs/admin/mcith/mcith_1154686260226.jpg" alt="" width="1280" height="800"></a></p>
I've left out a few details, but here's how I would do it using DOMDocument:
$s = <<<EOM
<p><img src="/uploads/userdirs/admin/1160501362291.png" alt="" width="1280" height="960" /></p>
<p><img src="/uploads/userdirs/admin/100_Bullets_68_1280x1024.jpg" alt="" width="1280" height="1024" /></p>
EOM;
$d = new DOMDocument;
$d->loadHTML($s);
foreach ($d->getElementsByTagName('img') as $img) {
$img_src = $img->attributes->getNamedItem('src')->nodeValue;
if (0 === strncasecmp($img_src, '/uploads/userdirs/admin', 23)) {
$a = $d->createElement('a');
$a->setAttribute('class', 'gallery');
$a->setAttribute('rel', 'whatever');
$a->setAttribute('href', '/uploads/userdirs/username/' . $img_src);
// disconnect image tag from parent
$img->parentNode->replaceChild($a, $img);
// and move to anchor
$a->appendChild($img);
}
}
echo $d->saveHTML();
You should change .* in your regular expression with [^>]*. The latter means: any character expect than >. Because regular expression gets as long match as possible. Without this additional condition, this ends up with two <img>'s matched.

Remove IMG tag from String for a specific image with PHP

I have a block of text that has a particular image that I want to strip out. The problem is that the tag can be with different styles
for e.g
<img src="myimage.png" alt="" class=""/>
or
<img alt="" class="" src="myimage.png"/>
or
<img class="" alt ="" src="myimage.png"/>
Now how can I remove that particular image tag from my string using PHP?
Something like:
$str = 'Lorem <img alt="" class="" src="myimage.png"/> ipsum <img class="" alt="" src="myimage.png"/> dolor <img src="myimage.png"/> sit...';
echo preg_replace('!<img.*?src="myimage.png".*?/>!i', '', $str);
// output: "Lorem ipsum dolor sit..."
maybe?
if you meant to extract attributes, try
$xpath = new DOMXPath(#DOMDocument::loadHTML($html));
$src = $xpath->evaluate("string(//img/#src)");

Categories