I have a sample code:
$content = 'I have a image <img border="0" alt="581.jpg - 58.03 KB" src="581.jpg">';
And php
preg_match('/<img.+src=[\'"](?P<src>.+)[\'"].*>/i', $content, $image);
echo $image[0];
Result is: 581.jpg" border="0" alt="581.jpg - , How to fix it ?
Writing a regex for this is ... problematic to say the least. I would recommend using this:
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('img') as $node) {
echo $node->getAttribute('src') . PHP_EOL;
}
Explanation:
The reasons why you shouldn't use regex for what you want is that the markup for HTML varies. The position of the src attribute can differ, it may use single quotes instead of double quotes(some HTML attributes don't need quotes, for example this syntax is correct: <img class=logo />), it may be uppercase, and probably other issues I can't think of right now.
Extra info:
Grabbing the href attribute of an A element
Related
I am have some HTML content, and I need to parse it, get all the images. Then print out the whole content but running a PHP class instance in every occurrence of the image
This is the content
<?php $content = 'Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">'; ?>
I need to be able to get the images and run a class method with the output.
So the result would be something like
<?php echo 'Some text
<p>A paragraph</p>';
$this->Image('image1.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
echo 'More text';
$this->Image('image2.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
But obviouly I imagine it would have to be a loop or something that does it automatically
To convert the entire HTML snippet to TcPDF as you mentioned in your comment, you'll need to parse the snippet with DOMDocument and loop through each child node deciding how to handle them appropriately.
The catch with the snippet you've provided above is that it isn't a complete HTML document, thus DOMDocument will wrap it in <html> and <body> tags when parsing it, loading the following structure internally:
<html>
<body>
Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">
</body>
</html>
This caveat is easily worked around, however, by building on #hakre's answer in the thread I linked to below. My recommendation would be something along the lines of the following:
// Load the snipped into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
// Use DOMXPath to retrieve the body content of the snippet
$xpath = new DOMXPath($doc);
$data = $xpath->evaluate('//html/body');
// <body> is now $data[0], so for readability we do this
$body = $data[0];
// Now we loop through the elements in your original snippet
foreach ($body->childNodes as $node) {
switch ($node->nodeName) {
case 'img':
// Get the value of the src attribute from the img element
$src = $node->attributes->getNamedItem('src')->nodeValue;
$this->Image($src, PDF_MARGIN_LEFT, $y_offset, 116, 85);
break;
default:
// Pass the line to TcPDF as a normal paragraph
break;
}
}
This way, you can easily add additional case 'blah': blocks to handle other elements which may appear in your $content snippets and handle them appropriately, and the content will be processed in the correct order without breaking the original flow of the text. :)
-- Original answer. Will work if you just want to extract the image sources and process them elsewhere independently of the rest of the content.
You can match all the <img> tags in your $content string by using a regular expression:
/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i
A live breakdown of the regex which you can play with to see how it works is here: http://regex101.com/r/tS5xY9
You can use this regex with preg_match_all() to retrieve all of the image tags from within your $content variable as follows:
$matches = array();
$num = preg_match_all('/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i', $content, $matches, PREG_SET_ORDER);
The PREG_SET_ORDER constant tells preg_match_all() to store its results in a manner which is more easily looped through when producing output, as the first index on the array (i.e., $matches[0], $matches[1], etc) will contain the complete set of matches from the regular expression. In the case of the regex above, $matches[0] will contain the following:
array(
0 => '<img src="image1.jpg" width="200" height="200">',
1 => 'image1.jpg',
)
You can now loop through $matches as $key => $match and pass $match[1] to your $this->Image() method.
Alternatively, if you don't want to loop through, you can just access each src attribute directly from $matches as $matches[0][1], $matches[1][1], etc.
If you need to be able to access the other attributes within the tags, then I recommend using the DOMDocument method provided by #hakre on Get img src with PHP. If you just need to access the src attribute, then using preg_match_all() is faster and more efficient as it does not need to load the entire DOM of the snippet into memory as objects to provide you with the data you need.
You could build a lexer or parser to find out where the images are.
You're looking for two tokens at the beginning: <img and the respective closing >. A starting point for this could be something like this:
$text = "hello <img src='//first.jpg'> there <img src='//second.jpg'>";
$pos = 0;
while (($opening = strpos($text, '<img', $pos)) !== FALSE) {
// Find the next closing bracket's location
$closing = strpos($text, '>', $opening);
$length = ($closing - $opening) + 1; // Add one for the closing '>'
$img_tag = substr($text, $opening, $length);
var_dump($img_tag);
// Update the loop position with our closing tag to advance the lexer
$pos = $closing;
}
You're going to have to then build methods to scan for the img tags. You can also add your PDF methods in the loop, too.
Another more manageable approach could be to build a class that walks through each character. It'd first look for an opening '<' character, then check if the next three are 'img', and if so proceed to scan for the src, height, width attributes respectively. This is more work but is way more flexible – you'll be able to scan for much more than just your image tags.
I have a web page source code that I want to use in my project. I want to use an image link in this code. So, I want to reach this link using regex in PHP.
That's it:
img src="http://imagelinkhere.com" class="image"
There is only one line like this.
My logic is to get the string between
="
and
" class="image"
characters.
How can I do that with REGEX? Thank you very much.
Don't use Regex for HTML .. try DomDocument
$html = '<html><img src="http://imagelinkhere.com" class="image" /></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$img = $dom->getElementsByTagName("img");
foreach ( $img as $v ) {
if ($v->getAttribute("class") == "image")
print($v->getAttribute("src"));
}
Output
http://imagelinkhere.com
preg_match("/(http://+.*?")/",$text,$matches);
var_dump($matches);
The link would be in $matches.
Using
.*="(.*)?" .*
with preg replace gives you only the url in the first regex group (\1).
So complete it would look like
$str='img src="http://imagelinkhere.com" class="image"';
$str=preg_replace('.*="(.*)?" .*','$1',$str);
echo $str;
-->
http://imagelinkhere.com
Edit:
Or just follow Baba's advice and use DOM Parser. I'll remember that regex will give you headaches when parsing html with it.
There is several ways to do so :
1.you can use
SimpleHTML Dom Parser which I prefer with simple HTML
2.you can also use preg_match
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" class="image" />';
$array = array();
preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
see this thread
I can hear the sound of hooves, so I have gone with DOM parsing instead of regex.
$dom = new DOMDocument();
$dom->loadHTMLFile('path/to/your/file.html');
foreach ($dom->getElementsByTagName('img') as $img)
{
if ($img->hasAttribute('class') && $img->getAttribute('class') == 'image')
{
echo $img->getAttribute('src');
}
}
This will echo only the src attribute of an img tag with a class="image"
Try using preg_match_all, like this:
preg_match_all('/img src="([^"]*)"/', $source, $images);
That should put all the URL's of the images in the $images variable. What the regex does is find all img src bits in the code and matches the bit between the quotes.
I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?
Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.
You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.
Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)
I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);
I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.