Extract Image SRC from string using preg_match_all

Extract Image SRC from string using preg_match_all - php

I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?

Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.

You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.

Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)

Related

PHP split string by <img> and break output

I am have some HTML content, and I need to parse it, get all the images. Then print out the whole content but running a PHP class instance in every occurrence of the image
This is the content
<?php $content = 'Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">'; ?>
I need to be able to get the images and run a class method with the output.
So the result would be something like
<?php echo 'Some text
<p>A paragraph</p>';
$this->Image('image1.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
echo 'More text';
$this->Image('image2.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
But obviouly I imagine it would have to be a loop or something that does it automatically

To convert the entire HTML snippet to TcPDF as you mentioned in your comment, you'll need to parse the snippet with DOMDocument and loop through each child node deciding how to handle them appropriately.
The catch with the snippet you've provided above is that it isn't a complete HTML document, thus DOMDocument will wrap it in <html> and <body> tags when parsing it, loading the following structure internally:
<html>
<body>
Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">
</body>
</html>
This caveat is easily worked around, however, by building on #hakre's answer in the thread I linked to below. My recommendation would be something along the lines of the following:
// Load the snipped into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
// Use DOMXPath to retrieve the body content of the snippet
$xpath = new DOMXPath($doc);
$data = $xpath->evaluate('//html/body');
// <body> is now $data[0], so for readability we do this
$body = $data[0];
// Now we loop through the elements in your original snippet
foreach ($body->childNodes as $node) {
switch ($node->nodeName) {
case 'img':
// Get the value of the src attribute from the img element
$src = $node->attributes->getNamedItem('src')->nodeValue;
$this->Image($src, PDF_MARGIN_LEFT, $y_offset, 116, 85);
break;
default:
// Pass the line to TcPDF as a normal paragraph
break;
}
}
This way, you can easily add additional case 'blah': blocks to handle other elements which may appear in your $content snippets and handle them appropriately, and the content will be processed in the correct order without breaking the original flow of the text. :)
-- Original answer. Will work if you just want to extract the image sources and process them elsewhere independently of the rest of the content.
You can match all the <img> tags in your $content string by using a regular expression:
/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i
A live breakdown of the regex which you can play with to see how it works is here: http://regex101.com/r/tS5xY9
You can use this regex with preg_match_all() to retrieve all of the image tags from within your $content variable as follows:
$matches = array();
$num = preg_match_all('/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i', $content, $matches, PREG_SET_ORDER);
The PREG_SET_ORDER constant tells preg_match_all() to store its results in a manner which is more easily looped through when producing output, as the first index on the array (i.e., $matches[0], $matches[1], etc) will contain the complete set of matches from the regular expression. In the case of the regex above, $matches[0] will contain the following:
array(
0 => '<img src="image1.jpg" width="200" height="200">',
1 => 'image1.jpg',
)
You can now loop through $matches as $key => $match and pass $match[1] to your $this->Image() method.
Alternatively, if you don't want to loop through, you can just access each src attribute directly from $matches as $matches[0][1], $matches[1][1], etc.
If you need to be able to access the other attributes within the tags, then I recommend using the DOMDocument method provided by #hakre on Get img src with PHP. If you just need to access the src attribute, then using preg_match_all() is faster and more efficient as it does not need to load the entire DOM of the snippet into memory as objects to provide you with the data you need.

You could build a lexer or parser to find out where the images are.
You're looking for two tokens at the beginning: <img and the respective closing >. A starting point for this could be something like this:
$text = "hello <img src='//first.jpg'> there <img src='//second.jpg'>";
$pos = 0;
while (($opening = strpos($text, '<img', $pos)) !== FALSE) {
// Find the next closing bracket's location
$closing = strpos($text, '>', $opening);
$length = ($closing - $opening) + 1; // Add one for the closing '>'
$img_tag = substr($text, $opening, $length);
var_dump($img_tag);
// Update the loop position with our closing tag to advance the lexer
$pos = $closing;
}
You're going to have to then build methods to scan for the img tags. You can also add your PDF methods in the loop, too.
Another more manageable approach could be to build a class that walks through each character. It'd first look for an opening '<' character, then check if the next three are 'img', and if so proceed to scan for the src, height, width attributes respectively. This is more work but is way more flexible – you'll be able to scan for much more than just your image tags.

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!

Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.

If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?

Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;

I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

PHP text to array and with key

I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)

I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.

i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}

dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference

Extract Image Sources from text in PHP - preg_match_all required

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...

Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}

You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract Image SRC from string using preg_match_all - php

Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.

You need to use a different function, if you only want one result: preg_match() returns the first and only the first match. preg_match_all() returns an array with all the matches.

Related

PHP split string by <img> and break output

PHP - Extracting two values from a line

How to Ignore Whitespaces using preg_match()

PHP text to array and with key

Extract Image Sources from text in PHP - preg_match_all required

Categories

Resources