i have no idea about php regex i wish to extract all image tags <img src="www.google.com/exampleimag.jpg"> form my html how can i do this using preg_match_all
thanks SO community for u'r precious time
well my scenario is like this there is not whole html dom but just a variable with img tag $text="this is a new text <img="sfdsfdimg/pfdg.fgh" > there is another iamh <img src="sfdsfdfsd.png"> hjkdhfsdfsfsdfsd kjdshfsd dummy text
Don't use regular expressions to parse HTML. Instead, use something like the DOMDocument that exists for this very reason:
$html = 'Sample text. Image: <img src="foo.jpg" />. <img src="bar.png" />';
$doc = new DOMDocument();
$doc->loadHTML( $html );
$images = $doc->getElementsByTagName("img");
for ( $i = 0; $i < $images->length; $i++ ) {
// Outputs: foo.jpg bar.png
echo $images->item( $i )->attributes->getNamedItem( 'src' )->nodeValue;
}
You could also get the image HTML itself if you like:
// <img src="foo.jpg" />
echo $doc->saveHTML ( $images->item(0) );
You can't parse HTML with regex. You're much better off using the DOM classes. They make it trivially easy to extract the images from a valid HTML tree.
$doc = new DOMDocument ();
$doc -> loadHTML ($html);
$images = $doc -> getElementsByTagName ('img'); // This will generate a collection of DOMElement objects that contain the image tags
Related
I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".
I.e.: If the user creates this HTML markup:
<p><img src="/image.jpg" /></p>
the following HTML is rendered:
<p>no images allowed</p>
Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.
Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:
foreach ($elements as $element) {
$domArray[] = $element;
$src= $element->getAttribute('src');
$frag= $dom->createElement('p');
$frag->nodeValue = 'no images allowed';
$element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();
It almost do what I want, but I get this:
<p><p>no images allowed</p></p>
I would fetch the elements with xpath, then replace with newly created text nodes.
$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
$frag= $dom->createTextNode('no images allowed');
$element->parentNode->insertBefore($frag, $element);
$element->parentNode->removeChild($element);
}
echo $dom->saveHtml();
Demo here: http://codepad.org/w9uj0ez9
To remove HTML self-enclosed img tag you may use a simple regular expression:
<?php
function no_images_allowed($text) {
return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}
print no_images_allowed('<p><img src="/image.jpg" /></p>');
It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.
Regex in example above will only work for self-enclosed img tag:
<img src="..."/>
<img src="...">
Please note that it will not work for example with:
<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>
If you want to include every possible case (even invalid ones) then proposed regex should be modified.
I have the following html code stored in a php variable:
<p>This is a sample paragraph</p>
<img src="img/1.jpg">
<h1>This is my header</h1>
<img src="img/2.jpg">
<p>I hope someone can help me</p>
<img src="img/3.jpg">
And I have a php array which has three elements, exactly as many as the image elements in the html string:
Array(3){
[0]<img src="img/new1.jpg">
[1]<img src="img/new2.jpg">
[2]<img src="img/new3.jpg">
}
I'm trying to write a function which will replace the first img tag in the html string with the first array element, the second one with the second in the array and the third with the third in the array.
So that at the end I get this:
<p>This is a sample paragraph</p>
<img src="img/new1.jpg">
<h1>This is my header</h1>
<img src="img/new2.jpg">
<p>I hope someone can help me</p>
<img src="img/new3.jpg">
Believe me I've no idea how to do this. If I would have an idea I would try but the problem is that I can't come to any logic for solving this problem.
Any help would be really great.
If you're able to have your new image tag array hold only the path/file info rather than entirely new HTML tags then something similar to the following should work:
$html = <<<'HTML'
<p>This is a sample paragraph</p>
<img src="img/1.jpg">
<h1>This is my header</h1>
<img src="img/2.jpg">
<p>I hope someone can help me</p>
<img src="img/3.jpg">
HTML;
$newImages = ['img/new1.jpg', 'img/new2.jpg', 'img/new3.jpg'];
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
for ($i = 0; $i < $images->length; $i++)
$images->item($i)->setAttribute('src', $newImages[$i]);
// Your updated HTML is now in $html
$html = $dom->saveHTML();
Note: You can modify your new image array to have only path/image using preg_replace or str_replace.
UPDATE BASED ON YOUR REPLY AND REQUEST FOR IMPROVEMENT IN ANSWER BELOW:
I forgot this with my earlier reply but as of PHP 5.4 and Libxml 2.6, loadHTML() accepts Libxml parameters. You can drop all the str_replace() stuff (see code).
There's no need to copy your argument into a local variable if $content is a simple string, as it'll be passed by value anyway (the original won't be modified).
I would not supress errors using #, use libxml_use_internal_errors and libxml_get_errors instead, in this instance.
As I mentioned in my comment, I don't see $imagetag_arr being passed to the function, declared global, or in front of $this. I've added it to the arg list of the function with a more descriptive name.
Updated code:
function replaceTags($content, $newImages)
{
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$images = $dom->getElementsByTagName('img');
for ($i = 0; $i < $images->length; $i++)
{
$dom2 = new DOMDocument();
$dom2->loadHTML($newImages[$i]);
$newImg = $dom2->getElementsByTagName('img')
->item(0);
$images->item($i)->setAttribute('src', $newImg->getAttribute('src'));
}
return $dom->saveHTML();
}
$a=Array (
'img/1.jpg'=>'img/new1.jpg',
'img/2.jpg'=>'img/new2.jpg',
'img/3.jpg'=>'img/new3.jpg');
$replace=array_values($a);
$find=array_keys($a);
$html=str_replace($find, $replace, $html);
If you want to replace at a higher level in the HTML object tree, you'll need to use a Dom parser, or run into problems with the likes of:
<img
src='img/1.jpg'
>
Try this:
$source = <<<'EOD'
<p>This is a sample paragraph</p>
<img src="img/1.jpg">
<h1>This is my header</h1>
<img src="img/2.jpg">
<p>I hope someone can help me</p>
<img src="img/3.jpg">
EOD;
$new = [
'<img src="img/new1.jpg">',
'<img src="img/new2.jpg">',
'<img src="img/new3.jpg">',
];
$i = 0;
$result = preg_replace_callback(
'/<img src="img\/[^.]+\.jpg">/',
function($matches) use ($new, &$i) {
return $new[$i++];
},
$source
);
It's tested and it works.
But maybe somebody could find a more elegant way of using $new and $i?
Thanks to all for your answers. I've some-kind combined all the answers and came to the following solution. It's not the best one from coding perspective and I really feel bad about myself because of writing this "amateur" script. But it's working under all conditions. If someone can improve this code I would be very thankful.
$content is my html string.
$imagetag_arr is the array with the new image tags.
function replaceTags($content){
$html = $content; //content is my html string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
for ($i = 0; $i < $images->length; $i++) {
$dom2 = new DOMDocument();
#$dom2->loadHTML($imagetag_arr[$i]);
$search = $dom2->getElementsByTagName('img');
foreach ($search as $item) {
$newsrc=$item->getAttribute('src');
}
$images->item($i)->setAttribute('src', $newsrc);
}
// my updated HTML is now in $html
$html = $dom->saveHTML();
//because I've saved it as html now the dom object is appending all the html
//and body tags to my $final result so I've replaced
// them all with empty strings.I hate myself
// for doing it like this.
$finalhtml=preg_replace('/<![^>]+>/i','',$html);
$finalhtml=str_replace('<html>','',$finalhtml);
$finalhtml=str_replace('<body>','',$finalhtml);
$finalhtml=str_replace('</body>','',$finalhtml);
$finalhtml=str_replace('</html>','',$finalhtml);
return $finalhtml;
}
I have a variable with HTML source and I need to find images within the variable that contain images with specific src attributes.
For example my image:
<img src="/path/img1.svg">
I have tried the below but doesnt work, any suggestions?
$hmtl = '<div> some stuff <img src="/path/img1.svg"/> </div><div>other stuff</div>';
preg_match_all('/<img src="/path/img1.svg"[^>]+>/i',$v, $images);
You should make use of DOMDocument Class, not regular expressions when it comes to parsing HTML.
<?php
$html='<img src="/path/img1.svg">';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $tag) {
echo $tag->getAttribute('src'); //"prints" /path/img1.svg
}
I want to create regex that match the text inside opening and its matching closing angle brackets of html img tag with PHP. Let's say I have the html text in variable $searchThis
$searchThis = "<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>";
I want to match the content in tags which ellipsis is substitution for. The result must be the following matches:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
This is how I imagine the pattern should be and which actually doesn't work for me:
$pattern = "<img([^\/]+)\/>";
Never try to parse HTML with regex. For parsing HTML use DOM Parser. Consider code like this:
$html = <<< EOF
<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$src = $node->attributes->getNamedItem('src')->nodeValue;
echo "src='$src'\n";
}
OUTPUT:
src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
Try:
preg_match_all("`<img (.*)/>`Uis", $searchThis, $results);
print_r($results);
Printing the structure of $results will show you its content.
Note: If you wish to be more accurate, I would suggest you to include src= in your search and go until the closing quote mark, in order to to only select the image address. Then you can add the missing text (src=) afterwards.
This way, you still gets the relative path, even when your image tag doesn't look like expected (i.e. there are other stuffs in the tag like alt="Smiley face" height="42" width="42").
Example Parsing With simplehtmldom
<?php
include("simplehtmldom/simple_html_dom.php");
// Create DOM from URL or file
$html = str_get_html("<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>");
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
?>
I use SimpleHTMLDOM to grab stuff from other web page but i have a problem how to just get urls inside of image ancor tag because that web page consists linking anchor tags as well as image anchor tags ! but i just want to get href value in side the image anchor tag !
<a href="I DO NOT NEED THIS VALUE"><a/>
<a href="I NEED THIS VALUE"><img src="xxxx"><a/>
but when call for the DOM its returns all the href URLs including linking anchor URLs ! I just need the URLs inside image anchor tag !
i use this code to call..
$hrefl = $html->find('a');
$count = 1;
for( $i = 0; $i < 50; $i++){
echo $hrefl[$count]->href;
$count++;
}
You need the href attribute of every link that contains an image tag. With xpath it's quite simple:
//a/img/../#href
You wrote that you use DOM, your code looks like written with simple html dom. That library is limited and nowadays not needed any longer because PHP has the DOMDocument and DOMXPath objects. I think simple html DOM has no xpath,
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query('//a/img/../#href');
$count = $hrefs->length;
foreach($hrefs as $href)
{
echo $href->nodeValue, "\n";
}
Demo
Probably you are using simplehtmldom library for the parsing purpose
I am not very much aware of it, I use DOMDocument for all my parsing purpose.
Very quick solution which I can suggest, is check whether the anchor tag has the image inside it, if yes get the value, otherwise skip it.
Something like this:
<?php
$doc = new DOMDocument();
#$doc->loadHTMLFile($urlofhtmlpage);
foreach($doc->getElementsByTagName('a') as $a){
foreach($a->getElementsByTagName('img') as $img){
echo $a->getAttribute('href');
}
}
?>
try this:
$hrefl = $html->find('a');
$count = 1;
for( $i = 0; $i < 50; $i++){
$img = $hrefl[$count]->find('img');
// check if var exists and is valid
if ($img ... ) {
echo $hrefl[$count]->href;
}
$count++;
}