Regex to extract images from HTML - how to get only JPGs? - php

I am using this PHP function to grab all <img> tags within any given HTML.
function extract_images($content)
{
$img = strip_tags(html_entity_decode($content),'<img>');
$regex = '~src="[^"]*"~';
preg_match_all($regex, $img, $all_images);
return $all_images;
}
This works and returns all images (gif, png, jpg, etc).
Anyone know how to change the regex...
~src="[^"]*"~
in order to only get files with JPG or JPEG extension?
Thanks a bunch.

Sooner or later the Regex Enforcement Agency will show up. It might as well be me :)
The proper way to do this is with a proper HTML DOM parser. Here's a DOMDocument solution. The usefulness of this is in that it's more robust than parsing the HTML by regex, and also gives you the ability to access or modify other HTML attributes on your <img> nodes at the same time.
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all images
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img) {
// Check the src attr of each img
$src = "";
$src = $img->getAttribute("src");
if (preg_match("/\.jp[e]?g$/i", $src) {
// Add it onto your $links array.
$links[] = $src;
}
See other answers for the simple regex solution, or adapt from the regex inside my foreach loop.

/src="[^"]*\.(jpg|jpeg)"/i
i -> case insensitive match

Related

PHP regex replace content in img src attribute?

I'm new to PHP and have following issue. I have to sanitize images and need to replace all img src if it starts with http:// with /files/images/.
Is that possible to do with regex?
I found a similar solution here but it does't fulfil everything I need.
A modification of the thread you linked to do what you are requesting:
$dom = new DOMDocument();
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('img') as $img) {
$current_src = $img->getAttribute('src');
$new_src = preg_replace('http://', '/files/images', $current_src);
$img->setAttribute( 'src', $new_src );
}
$content = $dom->saveHTML();
I don't have a machine with PHP readily available to test this for you (not at home), so let me know if you have any issues with it and I would be happy to troubleshoot for you.

How can i replace a multiple img-elements with plain text?

I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".
I.e.: If the user creates this HTML markup:
<p><img src="/image.jpg" /></p>
the following HTML is rendered:
<p>no images allowed</p>
Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.
Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:
foreach ($elements as $element) {
$domArray[] = $element;
$src= $element->getAttribute('src');
$frag= $dom->createElement('p');
$frag->nodeValue = 'no images allowed';
$element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();
It almost do what I want, but I get this:
<p><p>no images allowed</p></p>
I would fetch the elements with xpath, then replace with newly created text nodes.
$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
$frag= $dom->createTextNode('no images allowed');
$element->parentNode->insertBefore($frag, $element);
$element->parentNode->removeChild($element);
}
echo $dom->saveHtml();
Demo here: http://codepad.org/w9uj0ez9
To remove HTML self-enclosed img tag you may use a simple regular expression:
<?php
function no_images_allowed($text) {
return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}
print no_images_allowed('<p><img src="/image.jpg" /></p>');
It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.
Regex in example above will only work for self-enclosed img tag:
<img src="..."/>
<img src="...">
Please note that it will not work for example with:
<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>
If you want to include every possible case (even invalid ones) then proposed regex should be modified.

PHP Regex to find images with specific src attribute

I have a variable with HTML source and I need to find images within the variable that contain images with specific src attributes.
For example my image:
<img src="/path/img1.svg">
I have tried the below but doesnt work, any suggestions?
$hmtl = '<div> some stuff <img src="/path/img1.svg"/> </div><div>other stuff</div>';
preg_match_all('/<img src="/path/img1.svg"[^>]+>/i',$v, $images);
You should make use of DOMDocument Class, not regular expressions when it comes to parsing HTML.
<?php
$html='<img src="/path/img1.svg">';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $tag) {
echo $tag->getAttribute('src'); //"prints" /path/img1.svg
}

Identify image links in PHP

I have a PHP page which looks through some HTML for links and replaces them with links to a local PHP page; the problem is finding image links. I currently use this code:
$data = preg_replace('|(<a\s*[^>]*href=[\'"]?)|','\1newjs.php?url=', $data);
Which matches things like
Google
and will replace them with
Google
I am looking to do a similar things with image files (jpg, gif, png) and replace something like:
Image
With this:
Image
Note the '&image=1' in the new URL. Is it possible for me to do this using PHP, preferably with regular expressions?
As per usual with anything involving regexes and HTML: https://stackoverflow.com/a/1732454/118068
The proper solution is to use DOM operations:
$dom = new DOMDocument();
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$anchors = $xp->query('//a');
foreach($anchors as $a) {
$href = $a->getAttribute('href');
if (is_image_link($href)) { //
$a->setAttribute('href', ... new link here ...);
}
}

Get all images url from string [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
Hi,
I have found solution to get first image from string:
preg_match('~<img[^>]*src\s?=\s?[\'"]([^\'"]*)~i',$string, $matches);
But I can't manage to get all images from string.
One more thing... If image contains alternative text (alt attribute) how to get it too and save to another variable?
Thanks in advance,
Ilija
Don't do this with regular expressions. Instead, parse the HTML. Take a look at Parse HTML With PHP And DOM. This is a standard feature in PHP 5.2.x (and probably earlier). Basically the logic for getting images is roughly:
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
This should be trivial to adapt to finding images.
This is what I tried but can't get it print value of src
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the table by its tag name ***/
$images = $dom->getElementsByTagName('img');
/*** loop over the table rows ***/
foreach ($images as $img)
{
/*** get each column by tag name ***/
$url = $img->getElementsByTagName('src');
/*** echo the values ***/
echo $url->nodeValue;
echo '<hr />';
}
EDIT: I solved this problem
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($string);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach($images as $img)
{
$url = $img->getAttribute('src');
$alt = $img->getAttribute('alt');
echo "Title: $alt<br>$url<br>";
}
Note that Regular Expressions are a bad approach to parsing anything that involves matching braces.
You'd be better off using the DOMDocument class.
You assume that you can parse HTML using regular expressions. That may work for some sites, but not all sites. Since you are limiting yourself to only a subset of all web pages, it would be interesting to know how you limit yourself... maybe you can parse the HTML in a quite easy way from php.
Look at preg_match_all to get all matches.

Categories