I am have some HTML content, and I need to parse it, get all the images. Then print out the whole content but running a PHP class instance in every occurrence of the image
This is the content
<?php $content = 'Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">'; ?>
I need to be able to get the images and run a class method with the output.
So the result would be something like
<?php echo 'Some text
<p>A paragraph</p>';
$this->Image('image1.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
echo 'More text';
$this->Image('image2.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
But obviouly I imagine it would have to be a loop or something that does it automatically
To convert the entire HTML snippet to TcPDF as you mentioned in your comment, you'll need to parse the snippet with DOMDocument and loop through each child node deciding how to handle them appropriately.
The catch with the snippet you've provided above is that it isn't a complete HTML document, thus DOMDocument will wrap it in <html> and <body> tags when parsing it, loading the following structure internally:
<html>
<body>
Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">
</body>
</html>
This caveat is easily worked around, however, by building on #hakre's answer in the thread I linked to below. My recommendation would be something along the lines of the following:
// Load the snipped into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
// Use DOMXPath to retrieve the body content of the snippet
$xpath = new DOMXPath($doc);
$data = $xpath->evaluate('//html/body');
// <body> is now $data[0], so for readability we do this
$body = $data[0];
// Now we loop through the elements in your original snippet
foreach ($body->childNodes as $node) {
switch ($node->nodeName) {
case 'img':
// Get the value of the src attribute from the img element
$src = $node->attributes->getNamedItem('src')->nodeValue;
$this->Image($src, PDF_MARGIN_LEFT, $y_offset, 116, 85);
break;
default:
// Pass the line to TcPDF as a normal paragraph
break;
}
}
This way, you can easily add additional case 'blah': blocks to handle other elements which may appear in your $content snippets and handle them appropriately, and the content will be processed in the correct order without breaking the original flow of the text. :)
-- Original answer. Will work if you just want to extract the image sources and process them elsewhere independently of the rest of the content.
You can match all the <img> tags in your $content string by using a regular expression:
/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i
A live breakdown of the regex which you can play with to see how it works is here: http://regex101.com/r/tS5xY9
You can use this regex with preg_match_all() to retrieve all of the image tags from within your $content variable as follows:
$matches = array();
$num = preg_match_all('/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i', $content, $matches, PREG_SET_ORDER);
The PREG_SET_ORDER constant tells preg_match_all() to store its results in a manner which is more easily looped through when producing output, as the first index on the array (i.e., $matches[0], $matches[1], etc) will contain the complete set of matches from the regular expression. In the case of the regex above, $matches[0] will contain the following:
array(
0 => '<img src="image1.jpg" width="200" height="200">',
1 => 'image1.jpg',
)
You can now loop through $matches as $key => $match and pass $match[1] to your $this->Image() method.
Alternatively, if you don't want to loop through, you can just access each src attribute directly from $matches as $matches[0][1], $matches[1][1], etc.
If you need to be able to access the other attributes within the tags, then I recommend using the DOMDocument method provided by #hakre on Get img src with PHP. If you just need to access the src attribute, then using preg_match_all() is faster and more efficient as it does not need to load the entire DOM of the snippet into memory as objects to provide you with the data you need.
You could build a lexer or parser to find out where the images are.
You're looking for two tokens at the beginning: <img and the respective closing >. A starting point for this could be something like this:
$text = "hello <img src='//first.jpg'> there <img src='//second.jpg'>";
$pos = 0;
while (($opening = strpos($text, '<img', $pos)) !== FALSE) {
// Find the next closing bracket's location
$closing = strpos($text, '>', $opening);
$length = ($closing - $opening) + 1; // Add one for the closing '>'
$img_tag = substr($text, $opening, $length);
var_dump($img_tag);
// Update the loop position with our closing tag to advance the lexer
$pos = $closing;
}
You're going to have to then build methods to scan for the img tags. You can also add your PDF methods in the loop, too.
Another more manageable approach could be to build a class that walks through each character. It'd first look for an opening '<' character, then check if the next three are 'img', and if so proceed to scan for the src, height, width attributes respectively. This is more work but is way more flexible – you'll be able to scan for much more than just your image tags.
Related
I have a sample code:
$content = 'I have a image <img border="0" alt="581.jpg - 58.03 KB" src="581.jpg">';
And php
preg_match('/<img.+src=[\'"](?P<src>.+)[\'"].*>/i', $content, $image);
echo $image[0];
Result is: 581.jpg" border="0" alt="581.jpg - , How to fix it ?
Writing a regex for this is ... problematic to say the least. I would recommend using this:
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('img') as $node) {
echo $node->getAttribute('src') . PHP_EOL;
}
Explanation:
The reasons why you shouldn't use regex for what you want is that the markup for HTML varies. The position of the src attribute can differ, it may use single quotes instead of double quotes(some HTML attributes don't need quotes, for example this syntax is correct: <img class=logo />), it may be uppercase, and probably other issues I can't think of right now.
Extra info:
Grabbing the href attribute of an A element
I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?
Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.
You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.
Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)
I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)
I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.
i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}
dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference
Using PHP to curl a web page (some URL entered by user, let's assume it's valid).
Example: http://www.youtube.com/watch?v=Hovbx6rvBaA
I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).
I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.
What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.
<?php
$resultFromCurl = '
<html>
<body>
<img src="hello.jpg" />
Yep
<table background="yep.jpg">
</table>
<p>
Perhaps you should check out foo.jpg! I promise it
is safe for work.
</p>
</body>
</html>
';
// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
'//table/#background',
'//img/#src',
'//input/#src',
'//a/#href',
'//area/#href',
'//img/#longdesc',
);
$dom = #DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);
$urls = array();
foreach ($queries as $query) {
foreach ($xpath->query($query) as $link) {
if (preg_match('#\.(gif|jpe?g|png)$#', $link->textContent))
$urls[$link->textContent] = true;
}
}
if (preg_match_all('#\b[^\s]+\.(?:gif|jpe?g|png)\b#', $dom->textContent, $matches)) {
foreach ($matches as $m) {
$urls[$m[0]] = true;
}
}
$urls = array_keys($urls);
var_dump($urls);
Collect all image urls into an array, then use array_unique() to remove duplicates.
$my_image_links = array_unique( $my_image_links );
// No more duplicates
If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:
$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);
The above will capture the image link in stuff like:
<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
Live example
I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.