I'm trying to get CSS background-image URL from HTML Attribute using Simple HTML DOM .
This the codes
$String=' <div class="Wrapper"><i style="background-image: url(https://example.com/backgroud-1035.jpg);" class="uiMediaThumbImg"></i></div>';
$html = new simple_html_dom();
$html->load($String);
foreach($html->find('i') as $a0)
$src[$i++]=$a0->style;
foreach( $src as $css )
print($css);
The output is Like this :-
background-image: url(https://example.com/backgroud-1035.jpg);
All I want is strip background Url from the rest of CSS tags . Like this
https://example.com/backgroud-1035.jpg
You can use regex to strip out the text between parentheses.
foreach($html->find('i') as $a0){
$style = $a0->style;
preg_match('/\(([^)]+)\)/', $style, $match);
$src[$i++] = $match[1];
//echo $match[1];
}
Maybe you found the answer, but for who did not
I will use the explode() function so I can break the string into an array,
see more about explode() function here.
First I splited the $css variable into array. the array be like this:
background-image: url(https://example.com/backgroud-1035.jpg);"
Array ( [0] => background-image: [1] => https://example.com/backgroud-1035.jpg);
And then break the ['1'] into array, then it looks like this
Array ( [0] => https://example.com/backgroud-1035.jpg [1] => ; )
And then print the ['0'] .
// it will output https://example.com/backgroud-1035.jpg
full code:
$String=' <div class="Wrapper"><i style="background-image:
url(https://example.com/backgroud-1035.jpg);" class="uiMediaThumbImg"></i></div>';
$html = new simple_html_dom();
$html->load($String);
foreach($html->find('i') as $a0)
$src[$i++]=$a0->style;
foreach( $src as $css )
$explode1 = explode("url(",$css);
$explode2 = explode(")",$explode1['1']);
print_r ($explode2['0']);
I have this string:
<img src=images/imagename.gif alt='descriptive text here'>
and I am trying to split it up into the following two strings (array of two strings, what ever, just broken up).
imagename.gif
descriptive text here
Note that yes, it's actually the < and not <. Same with the end of the string.
I know regex is the answer, but I'm not good enough at regex to know how to pull it off in PHP.
Try this:
<?php
$s="<img src=images/imagename.gif alt='descriptive text here'>";
preg_match("/^[^\/]+\/([^ ]+)[^']+'([^']+)/", $s, $a);
print_r($a);
Output:
Array
(
[0] => <img src=images/imagename.gif alt='descriptive text here
[1] => imagename.gif
[2] => descriptive text here
)
Better use DOM xpath rather than regex
<?php
$your_string = html_entity_decode("<img src=images/imagename.gif alt='descriptive text here'>");
$dom = new DOMDocument;
$dom->loadHTML($your_string);
$x = new DOMXPath($dom);
foreach($x->query("//img") as $node)
{
echo $node->getAttribute("src");
echo $node->getAttribute("alt");
}
?>
I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:
<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />
So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.
The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.
Any thoughts on how to proceed with it would be great...
Thus far, I've tried:
$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);
I think this is the best approach:
Use an HTML parser to extract the image tags
Use a regular expression (or perhaps string manipulation) to extract the ID
Query for the data
Use the HTML parser to insert the returned data
Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);
foreach( $doc->getElementsByTagName('img') as $img)
{
$src = $img->getAttribute('src');
preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
$id = $matches[1];
echo 'Fetching info for image ID ' . $id . "\n";
// Query stuff here
$result = 'Got this from the DB';
$img->setAttribute( 'title', $result);
$img->setAttribute( 'alt', $result);
}
$newHTML = $doc->saveHtml();
Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.
preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);
This would contain all of the stuff that it found.
Consider using preg_replace_callback.
Use this regex: (images/([0-9]+)[^"]+")
Then, as the callback argument, use an anonymous function. Result:
$output = preg_replace_callback(
"(images/([0-9]+)[^\"]+\")",
function($m) {
// $m[1] is the number.
$t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
return $m[0]." title=\"".$t."\"";
},
$input
);
use preg_match_all:
preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => <img src="http://domain.com/images/59.
[1] => <img src="http://domain.com/images/549.
[2] => <img src="http://domain.com/images/1249.
[3] => <img src="http://domain.com/images/6.
)
[1] => Array
(
[0] => 59
[1] => 549
[2] => 1249
[3] => 6
)
)
This regex should match the number parts:
\/images\/(?P<digits>[0-9]+)\.[a-z]+
Your $matches['digits'] should have all of the digits you want as an array.
Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:
$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/#src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$ids[] = $number;
}
}
Because that only gives you an array, why not encapsulate it?
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$imageNumbers = new ImageNumbers($html);
var_dump((array) $imageNumbers);
Which gives you:
array(4) {
[0]=>
int(59)
[1]=>
int(549)
[2]=>
int(1249)
[3]=>
int(6)
}
By that function above nicely wrapped into an ArrayObject:
class ImageNumbers extends ArrayObject
{
public function __construct($html) {
parent::__construct($this->extractFromHTML($html));
}
private function extractFromHTML($html) {
$numbers = array();
$doc = new DOMDocument();
$preserve = libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/#src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$numbers[] = $number;
}
}
libxml_use_internal_errors($preserve);
return $numbers;
}
}
If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class.
$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);
Then loop through the matches array to both reconstruct the HTML and to do you look up in the database.
I have this code that extracts the first image from an article in joomla:
<?php preg_match('/<img (.*?)>/', $this->article->text, $match); ?>
<?php echo $match[0]; ?>
Is there a way to extract all the images that are available in the article and not only one?
I may suggest first to not use Regular Expressions to parse HTML. You should use an appropiate parser such as DOMDocument::loadHTML which uses libxml.
Then you may query for the desired tags you want. Something like this may work (untested):
$doc = new DOMDocument;
$doc->loadHTML($htmlSource);
$xpath = new DOMXPath($doc);
$query = '//img';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
// $entry->getAttribute('src')
}
Use preg_match_all. And you'll want to modify the pattern like so to take into account the trailing '/' inside the img tag.
$str = '<img src="asdf" />stuff more stuff <img src="qwerty" />';
preg_match_all('/<img (.*?)\/>/', $str, $matches);
print_r($matches);
Array
(
[0] => Array
(
[0] => <img src="asdf" />
[1] => <img src="qwerty" />
)
[1] => Array
(
[0] => src="asdf"
[1] => src="qwerty"
)
)
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I would like to create a page where all images which reside on my website are listed with title and alternative representation.
I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and alt from this HTML:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}
EDIT : now that I know better
Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.
Solution With regexp
In that case it's better to split the process into two parts :
get all the img tag
extract their metadata
I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Then we get all the img tag attributes with a loop :
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.
How does this stuff work ?
First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.
The regexps :
<img[^>]+>
We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.
(alt|title|src)=("[^"]*")
We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().
Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.
EDIT : answer to the first comment.
It's true that I did not think about the (hopefully few) people using single quotes.
Well, if you use only ', just replace all the " by '.
If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ΓΈ] to replace [^"].
Just to give a small example of using PHP's XML functionality for the task:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
I did use the DOMDocument::loadHTML() method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElement is not necessary - it just makes using xpath and the xpath results more simple.
If it's XHTML, your example is, you need only simpleXML.
<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
Output:
object(SimpleXMLElement)#1 (1) {
["#attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}
I used preg_match to do it.
In my case, I had a string containing exactly one <img> tag (and no other markup) that I got from Wordpress and I was trying to get the src attribute so I could run it through timthumb.
// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);
// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
In the pattern to grab the title or the alt, you could simply use $pattern = '/title="([^"]*)"/'; to grab the title or $pattern = '/title="([^"]*)"/'; to grab the alt. Sadly, my regex isn't good enough to grab all three (alt/title/src) with one pass though.
You may use simplehtmldom. Most of the jQuery selectors are supported in simplehtmldom. An example is given below
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
The script must be edited like this
foreach( $result[0] as $img_tag)
because preg_match_all return array of arrays
I have read the many comments on this page that complain that using a dom parser is unnecessary overhead. Well, it may be more expensive than a mere regex call, but the OP has stated that there is no control over the order of the attributes in the img tags. This fact leads to unnecessary regex pattern convolution. Beyond that, using a dom parser provides the additional benefits of readability, maintainability, and dom-awareness (regex is not dom-aware).
I love regex and I answer lots of regex questions, but when dealing with valid HTML there is seldom a good reason to regex over a parser.
In the demonstration below, see how easy and clean DOMDocument handles img tag attributes in any order with a mixture of quoting (and no quoting at all). Also notice that tags without a targeted attribute are not disruptive at all -- an empty string is provided as a value.
Code: (Demo)
$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;
libxml_use_internal_errors(true); // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
echo "IMG#{$i}:\n";
echo "\tsrc = " , $img->getAttribute('src') , "\n";
echo "\ttitle = " , $img->getAttribute('title') , "\n";
echo "\talt = " , $img->getAttribute('alt') , "\n";
echo "---\n";
}
Output:
IMG#0:
src = /image/fluffybunny.jpg
title = Harvey the bunny
alt = a cute little fluffy bunny
---
IMG#1:
src = /image/pricklycactus.jpg
title = Roger the cactus
alt = a big green prickly cactus
---
IMG#2:
src = /image/noisycockatoo.jpg
title = Polly the cockatoo
alt = an annoying white cockatoo
---
IMG#3:
src = somethingelse
title = something
alt =
---
Using this technique in professional code will leave you with a clean script, fewer hiccups to contend with, and fewer colleagues that wish you worked somewhere else.
Here's A PHP Function I hobbled together from all of the above info for a similar purpose, namely adjusting image tag width and length properties on the fly ... a bit clunky, perhaps, but seems to work dependably:
function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {
// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER);
// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
array_push($imagearray, $rawimagearray[$i][0]);
}
// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {
preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}
// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {
$ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
$OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
$OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);
$NewWidth = $OrignialWidth;
$NewHeight = $OrignialHeight;
$AdjustDimensions = "F";
if($OrignialWidth > $MaximumWidth) {
$diff = $OrignialWidth-$MaximumHeight;
$percnt_reduced = (($diff/$OrignialWidth)*100);
$NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100));
$NewWidth = floor($OrignialWidth-$diff);
$AdjustDimensions = "T";
}
if($OrignialHeight > $MaximumHeight) {
$diff = $OrignialHeight-$MaximumWidth;
$percnt_reduced = (($diff/$OrignialHeight)*100);
$NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100));
$NewHeight= floor($OrignialHeight-$diff);
$AdjustDimensions = "T";
}
$thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
array_push($AllImageInfo, $thisImageInfo);
}
// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {
if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
$NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
$NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);
$thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
}
}
// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
$HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}
return $HTMLContent;
}
Here is THE solution, in PHP:
Just download QueryPath, and then do as follows:
$doc= qp($myHtmlDoc);
foreach($doc->xpath('//img') as $img) {
$src= $img->attr('src');
$title= $img->attr('title');
$alt= $img->attr('alt');
}
That's it, you're done !