PHP find string of substring using Regex - php

I have a web page source code that I want to use in my project. I want to use an image link in this code. So, I want to reach this link using regex in PHP.
That's it:
img src="http://imagelinkhere.com" class="image"
There is only one line like this.
My logic is to get the string between
="
and
" class="image"
characters.
How can I do that with REGEX? Thank you very much.

Don't use Regex for HTML .. try DomDocument
$html = '<html><img src="http://imagelinkhere.com" class="image" /></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$img = $dom->getElementsByTagName("img");
foreach ( $img as $v ) {
if ($v->getAttribute("class") == "image")
print($v->getAttribute("src"));
}
Output
http://imagelinkhere.com

preg_match("/(http://+.*?")/",$text,$matches);
var_dump($matches);
The link would be in $matches.

Using
.*="(.*)?" .*
with preg replace gives you only the url in the first regex group (\1).
So complete it would look like
$str='img src="http://imagelinkhere.com" class="image"';
$str=preg_replace('.*="(.*)?" .*','$1',$str);
echo $str;
-->
http://imagelinkhere.com
Edit:
Or just follow Baba's advice and use DOM Parser. I'll remember that regex will give you headaches when parsing html with it.

There is several ways to do so :
1.you can use
SimpleHTML Dom Parser which I prefer with simple HTML
2.you can also use preg_match
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" class="image" />';
$array = array();
preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
see this thread

I can hear the sound of hooves, so I have gone with DOM parsing instead of regex.
$dom = new DOMDocument();
$dom->loadHTMLFile('path/to/your/file.html');
foreach ($dom->getElementsByTagName('img') as $img)
{
if ($img->hasAttribute('class') && $img->getAttribute('class') == 'image')
{
echo $img->getAttribute('src');
}
}
This will echo only the src attribute of an img tag with a class="image"

Try using preg_match_all, like this:
preg_match_all('/img src="([^"]*)"/', $source, $images);
That should put all the URL's of the images in the $images variable. What the regex does is find all img src bits in the code and matches the bit between the quotes.

Related

php regular expression to remove unwanted code

The editor I am using is adding extraneous coding that I would like to remove via php before writing to the database.
The code looks like this:
<img style="width: 250px;" src="files/school-big.jpg" data-cke-saved-src="files/school-big.jpg" alt="">
<img style="width: 250px;" src="files/firemen.jpg" data-cke-saved-src="files/firemen.jpg" alt="">
What I need to get rid of is the data-cke-saved-src="files/image-name". My understanding of regex is somewhere below weak so how would I build a regex to grab the image name without grabbing the end of the line or the rest of the content?
Thank you kindly,
Try this:
$data = preg_replace('#\s(data-cke-saved-src)="[^"]+"#', '', $data);
Or do it in jQuery before going into PHP with this:
$('img').removeAttr('data-cke-saved-src')
Try adding and using this function:
/*
*I am assuming you get all the data in a single variable.
*/
function remove_data_cke($text) {
// Get all data-cke-saved-src="..." tags from the html.
$result = array();
preg_match_all('|data-cke-saved-src="[^"]*"|U', $text, $result);
// Replace all occurances with an empty string.
foreach($result[0] as $data_cke) {
$text = str_replace($data_cke, '', $text);
}
return $text;
}
You can use DOM to easily remove the attribute:
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
foreach ($doc->getElementsByTagName('img') as $img) {
$img->removeAttribute('data-cke-saved-src');
}

Error when get src from first image using php?

I have a sample code:
$content = 'I have a image <img border="0" alt="581.jpg - 58.03 KB" src="581.jpg">';
And php
preg_match('/<img.+src=[\'"](?P<src>.+)[\'"].*>/i', $content, $image);
echo $image[0];
Result is: 581.jpg" border="0" alt="581.jpg - , How to fix it ?
Writing a regex for this is ... problematic to say the least. I would recommend using this:
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('img') as $node) {
echo $node->getAttribute('src') . PHP_EOL;
}
Explanation:
The reasons why you shouldn't use regex for what you want is that the markup for HTML varies. The position of the src attribute can differ, it may use single quotes instead of double quotes(some HTML attributes don't need quotes, for example this syntax is correct: <img class=logo />), it may be uppercase, and probably other issues I can't think of right now.
Extra info:
Grabbing the href attribute of an A element

Strip directory structure in HTML

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Extract Image Sources from text in PHP - preg_match_all required

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

Categories