I have a variable with HTML source and I need to find images within the variable that contain images with specific src attributes.
For example my image:
<img src="/path/img1.svg">
I have tried the below but doesnt work, any suggestions?
$hmtl = '<div> some stuff <img src="/path/img1.svg"/> </div><div>other stuff</div>';
preg_match_all('/<img src="/path/img1.svg"[^>]+>/i',$v, $images);
You should make use of DOMDocument Class, not regular expressions when it comes to parsing HTML.
<?php
$html='<img src="/path/img1.svg">';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $tag) {
echo $tag->getAttribute('src'); //"prints" /path/img1.svg
}
Related
I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".
I.e.: If the user creates this HTML markup:
<p><img src="/image.jpg" /></p>
the following HTML is rendered:
<p>no images allowed</p>
Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.
Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:
foreach ($elements as $element) {
$domArray[] = $element;
$src= $element->getAttribute('src');
$frag= $dom->createElement('p');
$frag->nodeValue = 'no images allowed';
$element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();
It almost do what I want, but I get this:
<p><p>no images allowed</p></p>
I would fetch the elements with xpath, then replace with newly created text nodes.
$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
$frag= $dom->createTextNode('no images allowed');
$element->parentNode->insertBefore($frag, $element);
$element->parentNode->removeChild($element);
}
echo $dom->saveHtml();
Demo here: http://codepad.org/w9uj0ez9
To remove HTML self-enclosed img tag you may use a simple regular expression:
<?php
function no_images_allowed($text) {
return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}
print no_images_allowed('<p><img src="/image.jpg" /></p>');
It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.
Regex in example above will only work for self-enclosed img tag:
<img src="..."/>
<img src="...">
Please note that it will not work for example with:
<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>
If you want to include every possible case (even invalid ones) then proposed regex should be modified.
This is my Regex to fetch all tags with class:
preg_match_all('/<\s*\w*\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>/',file,$matches);
It matches all tags with class like <a class="abc">
The problem is that if any tag contains extra attribute before class than this Regex are unable to get it.
E.g.: <a id="fig_3_1" class="figure-contents">
I want <a class="figure-contents"> by ignore fig_3_1
Any idea to exclude it?
<\s*\w*.*?\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>
Probably this works
but you better use simple_html_dom
Take a look at this amazing SO post and reconsider.
You will most likely be better of using a html parser instead. You can do so using the DOM model.
A simple sample of how it can be used below.
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' .$image->getAttribute('src'));
}
$html = $dom->saveHTML();
I want to create regex that match the text inside opening and its matching closing angle brackets of html img tag with PHP. Let's say I have the html text in variable $searchThis
$searchThis = "<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>";
I want to match the content in tags which ellipsis is substitution for. The result must be the following matches:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
This is how I imagine the pattern should be and which actually doesn't work for me:
$pattern = "<img([^\/]+)\/>";
Never try to parse HTML with regex. For parsing HTML use DOM Parser. Consider code like this:
$html = <<< EOF
<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$src = $node->attributes->getNamedItem('src')->nodeValue;
echo "src='$src'\n";
}
OUTPUT:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
Try:
preg_match_all("`<img (.*)/>`Uis", $searchThis, $results);
print_r($results);
Printing the structure of $results will show you its content.
Note: If you wish to be more accurate, I would suggest you to include src= in your search and go until the closing quote mark, in order to to only select the image address. Then you can add the missing text (src=) afterwards.
This way, you still gets the relative path, even when your image tag doesn't look like expected (i.e. there are other stuffs in the tag like alt="Smiley face" height="42" width="42").
Example Parsing With simplehtmldom
<?php
include("simplehtmldom/simple_html_dom.php");
// Create DOM from URL or file
$html = str_get_html("<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>");
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
?>
Let's say i have many images in a string and i only want to get the src of the image with a specific class
<img src="image1.jpg"/>
<img src="image2.jpg"/>
<img src="image3.jpg" class="main"/>
I wanted to get the src of the third one, which has a class main. How do i do that?
$pattern = '/< *img[^>]*src *= *["\']?([^"\']*)/i';
preg_match($pattern,$response,$matches);
this one matches all img tags.
Don't use a regex to parse HTML. Use DOMDocument instead.
Here's some code:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$imgs = $xp->query("//img[#class='main']");
$imgs now has a NodeList of images with the main class. (I think - I haven't used DOMXPath much)
I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.
Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'
It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.
Target Example:
We want it to ignore this:
<span style="color:#bfbcba;">Howdy</span>
But not this:
Howdy
Or this:
<img src="myimg.gif" />Howdy
--EDIT--
Using the PHP DOM library as suggested in the comments, this is what I have so far:
$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
$spancount = $tag->getElementsByTagName("span")->length;
if($spancount == 0){
$element = $doc->createElement('span');
$tag->appendChild($element);
}
}
echo $doc->saveHTML();`
Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.
Don't use regex for this, it's not ideal for HTML.
Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').
EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$html = innerHTML( $anchorRef );
This may also help you out: Change innerHTML of a php DOMElement