Extract Image Sources from text in PHP - preg_match_all required - php

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...

Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}

You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

Related

php regex to add class to images without class

I'm looking for a php regex to check if a image don't have any class, then add "img-responsive" to that image's class.
thank you.
Instead of looking to implement a regular expression, make effective use of DOM instead.
$doc = new DOMDocument;
$doc->loadHTML($html); // load the HTML data
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if (!$img->hasAttribute('class'))
$img->setAttribute('class', 'img-responsive');
}
I would be tempted to do this in JQuery. That offers all the functionality you need in a few lines.
$(document).ready(function(){
$('img').not('img[class]').each(function(e){
$(this).addClass('img-responsive');
});
});
If you have the output in PHP then a HTML parser is the way to do it. Regular expressions will always fail, in the end. If you don't want to use a parser, but you have the HTML code you can try to do it with plain and simple PHP code:
function addClassToImagesWithout($html)
// this function does what you want, given well-formed html
{
// cut into parts where the images are
$parts = explode('<img',$html);
foreach ($parts as $key => $part)
{
// spilt at the end of tags, the image args are in the first bit
$bits = explode('>',$part);
// does it not contain a class
if (strpos($bits[0],'class=') !== FALSE)
{
// insert the class
$bits[0] .= " class='img-responsive'";
}
// recombine the bits
$part[$key] = implode('>',$bits);
}
// recombine the parts and return the html
return implode('<img',$parts);
}
this code is untested and far from perfect, but it shows that regular expressions are not needed. You will have to add in some code to catch exceptions.
I must stress that this code, just like regular expressions will ultimately fail when, for instance, you have something like id='classroom', title='we are a class apart' or similar. To do a better job you should use a parser:
http://htmlparsing.com/php.html

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!
Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.
If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

get wrapping element using preg_match php

I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

Syntax regex to extract the source of an image

Ahoy there!
I can't "guess" witch syntax should I use to be able to extract the source of an image but simply the web address not the src= neither the quotes?
Here is my piece of code:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
echo $path[0];
}
}
When I use it I got this printed:
src="http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg"
And I wish to get only this:
http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg
Any idea?
Thanks for your help.
Not exactly an answer to your question, but when parsing html, consider using a proper html parser:
foreach($html->find('img') as $element) {
echo $element->src . '<br />';
}
See: http://simplehtmldom.sourceforge.net/
$path[1] instead of $path[0]
echo $path[1];
$path[0] is the full string matched. $path[1] is the first grouping.
You could explode the string using " as a delimeter and then the second item in the array you get would be the right string:
$array = explode('"',$full_src);
$bit_you_want = $array[1];
Reworking your original function, it would be:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
$src = explode('"', $path);
echo $src[1];
}
}
Thanks Ithcy for his right answer.
I guess I've been too long to respond because he deleted it, I just don't know where his answer's gone...
So here is the one I've received by mail:
'|src="(.*?)"|i' makes no sense as a
regex. try '|src="([^"]+)"|i' instead.
(Which still isn't the most robust
solution but is better than what
you've got.)
Also, what everyone else said. You
want $path1, NOT $path[0]. You're
already extracting all the src
attributes into $matches[]. That has
nothing to do with $path[0]. If you're
not getting all of the src attributes
in the text, there is a problem
somewhere else in your code.
One more thing - you should use a real
HTML parser for this, because img tags
are not the only tags with src
attributes. If you're using this code
on raw HTML source, it's going to
match not just but
tags, etc.
— ithcy
I did everything he told me to do including using a HTML parser from Bart (2nd answer).
It works like a charm ! Thank you mate...

Categories