RegEx for linked images of certain class - php

I don't have access to an HTML parser on my server, so I need to do this via RegEx and PHP. I want to match all occurrences of linked images of a certain class within a large content string.
Here's a sample taken out of the larger content string that I want to match:
<a href='url'><img width="150" height="150" src="url" class="attachment-thumbnail" alt="Description" /></a>
This seems to match class="attachment-thumbnail"
(class=("|"([^"]*)\s)attachment-thumbnail("|\s([^"]*)"))
This seems to match everything from the opening HREF to the closing HREF, but it also gets other images in the larger content string that don't have class="attachment-thumbnail"
/(<a[^>]*)(href=)([^>]*?)(><img[^>]*></a>)/igm
How can I combine the above two to match only those HREFed images of class="attachment-thumbnail"?
Thanks for your help.

Try something like the following:
$html = '<img width="150" height="150" src="url" class="attachment-thumbnail" alt="Description" />';
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach($doc->getElementsByTagName('img') as $item)
{
$doc->saveHTML($item);
if ($item->getAttribute('class') == 'attachment-thumbnail')
{
echo $item->getAttribute('src');
}
}
To remove all elements that match the class 'attachment-thumbnail':
$html = '<img width="150" height="150" src="url" class="attachment-thumbnail" alt="Description" />';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[contains(attribute::class,"attachment-thumbnail")]') as $elem)
{
$elem->parentNode->removeChild($elem);
}
echo $dom->saveHTML($doc->documentElement);

Related

Add specific class to img tags if missing

I am using Nette PHP (framework shouldn't matter), and I'm trying to replace parts of html with different one - if image tag has class=, it will be replaced with class="image-responsive, and if not it will get a new attribute class="image-responsive".
I'm getting that HTML as a string, which will be saved in database!
This is my current code. It can find the strings, but what I need help with is replacing parts of the html.
public static function ImageAddClass($string)
{
// Match Img with class="$1 (group 1 here)"
$regex_img = '/(<img)([^>]*[^>]*)(\/>)/mi';
$regex_imgClass = '/(<img[^>]* )(class=\")([^\"]*\"[^>]*>)/mi';
$html = $string;
if (preg_match_all($regex_img, $html, $matches)) {
for ($x = 0; $x < count($matches[0]); $x++) {
bdump($matches[0]);
bdump($matches[0][$x]);
bdump($x);
if (preg_match($regex_imgClass, $matches[0][$x])) {
$html = preg_replace($regex_imgClass, '$1class="image-responsiveO $3', $html);
} else if (preg_match($regex_img, $matches[0][$x])) {
$html = preg_replace($regex_img, '$1 class="image-responsiveN" $2$3', $html);
}
}
return $html;
}
}
Covering all scenarios where an img tag might have no class attribute, an orphaned class attribute, a blank class attribute, a class attribute with one or more other words, and a class attribute that already contains image-responsive -- I prefer to use XPath to filter the elements.
Not only is parsing HTML with a legitimate DOM parser like DOMDocument more robust/reliable than regex, the accompanying XPath syntax is highly intuitive.
Pay close attention to how the XPath query pads the haystack class and the needle class with spaces as a means to ensure whole word matching.
Any images that are iterated will have the desired value added to the element's class attribute.
Code: (Demo)
$html = <<<HTML
<div>
<img src="">
<img src="" class>
<img src="" class="image-responsive">
<img src="" class="">
<img src="image-responsive" class="classy">
<img src="" class="image-responsiveness">
<span class='NOT-responsive'></span>
<img src="" class = "foo image-responsive">
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//img[not(contains(concat(" ", #class, " ")," image-responsive "))]') as $img) {
$img->setAttribute('class', ltrim($img->getAttribute('class') . ' image-responsive'));
}
echo $dom->saveHTML();
Output:
<div>
<img src="" class="image-responsive">
<img src="" class="image-responsive">
<img src="" class="image-responsive">
<img src="" class="image-responsive">
<img src="image-responsive" class="classy image-responsive">
<img src="" class="image-responsiveness image-responsive">
<span class="NOT-responsive"></span>
<img src="" class="foo image-responsive">
</div>
Related content:
Replace empty alt in wordpress post content with filter
Xpath syntax for "and not contains"
Parsing HTML with PHP To Add Class Names
How can I match on an attribute that contains a certain string?
As a slight variation, you can access all img tags without XPath, then use preg_match() calls to determine which tags should receive the new class. The word boundary character \b is not useful in this case because class names may contain non-word characters.
Code: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($dom->getElementsByTagName('img') as $img) {
$class = $img->getAttribute('class');
if (!preg_match('/(?:^| )image-responsive(?: |$)/', $class)) {
$img->setAttribute('class', ltrim("$class image-responsive"));
}
}
echo $dom->saveHTML();
// same output as first snippet

Replace all images in HTML with text

I am trying to replace all images in some HTML which meet specific requirements with the appropriate text. The specific requirements are that they are of class "replaceMe" and the image src filename is in $myArray. Upon searching for solutions, it appears that some sort of PHP DOM technique is appropriate, however, I am very new with this. For instance, given $html, I wish to return $desired_html. At the bottom of this post is my attempted implementation which currently doesn't work. Thank you
$myArray=array(
'goodImgage1'=>'Replacement for Good Image 1',
'goodImgage2'=>'Replacement for Good Image 2'
);
$html = '<div>
<p>Random text and an <img src="goodImgage1.png" alt="" class="replaceMe">. More random text.</p>
<p>Random text and an <img src="goodImgage2.png" alt="" class="replaceMe">. More random text.</p>
<p>Random text and an <img src="goodImgage2.png" alt="" class="dontReplaceMe">. More random text.</p>
<p>Random text and an <img src="badImgage1.png" alt="" class="replaceMe">. More random text.</p>
</div>';
$desiredHtml = '<div>
<p>Random text and an Replacement for Good Image 1. More random text.</p>
<p>Random text and an Replacement for Good Image 2. More random text.</p>
<p>Random text and an <img src="goodImgage2.png" alt="" class="dontReplaceMe">. More random text.</p>
<p>Random text and an <img src="badImgage1.png" alt="" class="replaceMe">. More random text.</p>
</div>';
Below is what I am attempting to do..
libxml_use_internal_errors(true); //Temorarily disable errors resulting from improperly formed HTML
$doc = new DOMDocument();
$doc->loadHTML($html);
//What does this do for me?
$imgs= $doc->getElementsByTagName('img');
foreach ($imgs as $img){}
$xpath = new DOMXPath($doc);
foreach( $xpath->query( '//img') as $img) {
if(true){ //How do I check class and image name?
$new = $doc->createTextNode("New Attribute");
$img->parentNode->replaceChild($new,$img);
}
}
$html=$doc->saveHTML();
libxml_use_internal_errors(false);
Do it like this, you were on a good way:
$myArray=array(
'goodImgage1.png'=>'Replacement for Good Image 1',
'goodImgage2.png'=>'Replacement for Good Image 2'
);
$html = '<div>
<p>Random text and an <img src="goodImgage1.png" alt="" class="replaceMe">. More random text.</p>
<p>Random text and an <img src="goodImgage2.png" alt="" class="replaceMe">. More random text.</p>
<p>Random text and an <img src="goodImgage2.png" alt="" class="dontReplaceMe">. More random text.</p>
<p>Random text and an <img src="badImgage1.png" alt="" class="replaceMe">. More random text.</p>
</div>';
$classesToReplace = array('replaceMe');
libxml_use_internal_errors(true); //Temorarily disable errors resulting from improperly formed HTML
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach( $xpath->query( '//img') as $img) {
// get the classes into an array
$classes = explode(' ', $img->getAttribute('class')); // this will contain the classes assigned to the element
$classMatches = array_intersect($classes, $classesToReplace);
// preprocess the image name to match the $myArray keys
$imageName = $img->getAttribute('src');
if (isset($myArray[$imageName]) && $classMatches) {
$new = $doc->createTextNode($myArray[$imageName]);
$img->parentNode->replaceChild($new,$img);
}
}
echo var_dump($html = $doc->saveHTML());
Please note the following:
I made the code check for images that have the replaceMe class, potentially in addition to other classes
I added the full image file names to your $myArray keys, basically for simplicity.
likeitlikeit was faster. I'll post my answer, though, because it has some differences in detail, e.g. xpath doing the job of selecting only <img> with the appropriate class attribute, use of pathinfo to get filename without extension.
$doc = new DOMDocument();
$doc->loadHTML($h); // assume HTML in $h
$xpath = new DOMXPath($doc);
$imgs = $xpath->query("//img[#class = 'replaceMe']");
foreach ($imgs as $img) {
$imgfile = pathinfo($img->getAttribute("src"),PATHINFO_FILENAME);
if (array_key_exists($imgfile, $myArray)) {
$replacement = $doc->createTextNode($myArray[$imgfile]);
$img->parentNode->replaceChild($replacement, $img);
}
}
echo "<pre>" . htmlentities($doc->saveHTML()) . "</pre>";
see it working: http://codepad.viper-7.com/11XZt7

preg_match move selection above paragraph

I'm wanting to move images above their container paragraphs in a large body of text using preg_replace.
So, I might have
$body = '<p><img src="a" alt="image"></p><img src="b" alt="image"><p>something here<img src="c" alt="image"> text</p>'
What I want (apart from the 40' yacht etc etc);
<img src="a" alt="image"><p></p><img src="b" alt="image"><img src="c" alt="image"><p>something here text</p>
I've got this, which aint working,
$body = preg_replace('/(<p>.*\s*)(<img.*\s*?image">)(.*\s*?<\/p>)/', '$2$1$3',$body);
It results in;
<img src="c" alt="image"><p><img src="a" alt="image"></p><img src="b" alt="image"><p>something here text</p>
You should load the HTML with DOMDocument and use its operations to move nodes around:
$content = <<<EOM
<p><img src="a" alt="image"></p>
<img src="b" alt="image"><p>something here<img src="c" alt="image"> text</p>
EOM;
$doc = new DOMDocument;
$doc->loadHTML($content);
$xp = new DOMXPath($doc);
// find images that are a direct descendant of a paragraph
foreach ($xp->query('//p/img') as $img) {
$parent = $img->parentNode;
// move image as a previous sibling of its parent
$parent->parentNode->insertBefore($img, $parent);
}
echo $doc->saveHTML();

How to get img tag value inside a specific div and specific anchor tag using regular expression

I am new to regular expression i tried a lot for getting image tag value inside a anchor tag html
this is my html expresstion
<div class="smallSku" id="ctl00_ContentPlaceHolder1_smallImages">
<a title="" name="http://www.playg.in/productImages/med/PNC000051_PNC000051.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051.jpg"></a> <a title="PNC000051_PNC000051_1.jpg" name="http://www.playg.in/productImages/med/PNC000051_PNC000051_1.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051_1.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="PNC000051_PNC000051_1.jpg" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051_1.jpg"></a>
</div>
i want to return only the src value of image tag and i tried a matching pattern in "preg_match_all()" and the pattern was
"#<div[\s\S]class="smallSku"[\s\S]id="ctl00_ContentPlaceHolder1_smallImages"\><a title=\"\" name="[\w\W]" href="[\w\W]" onclick=\"[\w\W]" onmouseover="[\w\W]"\><img[\s\S]src="(.*)"[\s\S]></a><\/div>#"
please help i tried a lots of time for this also tried with this link too Match image tag not nested in an anchor tag using regular expression
Regular expression is not the right tool for parsing HTML. See this FAQ: How to parse and process HTML/XML?
Here is an example on how to get the src property using your example:
$doc = new DOMDocument();
$doc->loadHTML($your_html_string);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//div[#class="smallSku"]/a/img/#src') as $attr) {
$src = $attr->value;
print $src;
}
try this sunith
$content = file_get_contents('your url');
preg_match_all("|<div class='items'>.*</div>|", $content, $arr, PREG_PATTERN_ORDER);
preg_match_all("/src='([^']+)'/", $arr[0][0], $arrr, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($arrr);

Getting a specific URL using simple_html_dom based on the end of the URL

I need to grab a URL using simple_html_dom based on the end of the URL. The URL has no specific class to make it unique. The only thing unique about it is that it ends with a specific set of numbers. I just cannot figure out the proper syntax to grab that specific URL and then print it.
Any help?
EXAMPLE:
<table class="findList">
<tr class="findResult odd"> <td class="primary_photo"> <a href="/title/tt0080487/?ref_=fn_al_tt_1" ><img src="http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg" height="44" width="32" /></a> </td>
That is the code for the beginning of the table. That first href is the one I want to grab. The table continues with more links, etc, but that's not relevant to what I want.
For the first a with a href ending in 1:
$dom->find('a[href$="1"]', 0);
You can simply use DOMdocument
<?php
$html = '
<table class="findList">
<tr class="findResult odd">
<td class="primary_photo">
<a href="/title/tt0080487/?ref_=fn_al_tt_1" ><img src="http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg" height="44" width="32" /></a>
</td>
';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $td) {
if($td->getAttribute('class') == 'primary_photo'){
$a = $td->getElementsByTagName('a')->item(0)->getAttribute('href');
}
}
echo $a; // title/tt0080487/?ref_=fn_al_tt_1
//Or if your looking to get the img tag
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $td) {
if($td->getAttribute('class') == 'primary_photo'){
$a = $td->getElementsByTagName('img')->item(0)->getAttribute('src');
}
}
echo $a; // http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg
?>
Assuming you have your html in a file called "tables.html", this will work. It reads the file, finds all the 'a' links, puts them into a array, and the first one ($anchors[0]) is the one you want. Then you get the href from it with $anchors[0]->href.
$html = new simple_html_dom();
$html->load_file('tables.html');
$anchors = $html->find("a");
echo $anchors[0]->href;

Categories