Strip directory structure in HTML - php

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark

As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php

This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>

If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Related

PHP find string of substring using Regex

I have a web page source code that I want to use in my project. I want to use an image link in this code. So, I want to reach this link using regex in PHP.
That's it:
img src="http://imagelinkhere.com" class="image"
There is only one line like this.
My logic is to get the string between
="
and
" class="image"
characters.
How can I do that with REGEX? Thank you very much.
Don't use Regex for HTML .. try DomDocument
$html = '<html><img src="http://imagelinkhere.com" class="image" /></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$img = $dom->getElementsByTagName("img");
foreach ( $img as $v ) {
if ($v->getAttribute("class") == "image")
print($v->getAttribute("src"));
}
Output
http://imagelinkhere.com
preg_match("/(http://+.*?")/",$text,$matches);
var_dump($matches);
The link would be in $matches.
Using
.*="(.*)?" .*
with preg replace gives you only the url in the first regex group (\1).
So complete it would look like
$str='img src="http://imagelinkhere.com" class="image"';
$str=preg_replace('.*="(.*)?" .*','$1',$str);
echo $str;
-->
http://imagelinkhere.com
Edit:
Or just follow Baba's advice and use DOM Parser. I'll remember that regex will give you headaches when parsing html with it.
There is several ways to do so :
1.you can use
SimpleHTML Dom Parser which I prefer with simple HTML
2.you can also use preg_match
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" class="image" />';
$array = array();
preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
see this thread
I can hear the sound of hooves, so I have gone with DOM parsing instead of regex.
$dom = new DOMDocument();
$dom->loadHTMLFile('path/to/your/file.html');
foreach ($dom->getElementsByTagName('img') as $img)
{
if ($img->hasAttribute('class') && $img->getAttribute('class') == 'image')
{
echo $img->getAttribute('src');
}
}
This will echo only the src attribute of an img tag with a class="image"
Try using preg_match_all, like this:
preg_match_all('/img src="([^"]*)"/', $source, $images);
That should put all the URL's of the images in the $images variable. What the regex does is find all img src bits in the code and matches the bit between the quotes.

Extract Image Sources from text in PHP - preg_match_all required

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

Syntax regex to extract the source of an image

Ahoy there!
I can't "guess" witch syntax should I use to be able to extract the source of an image but simply the web address not the src= neither the quotes?
Here is my piece of code:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
echo $path[0];
}
}
When I use it I got this printed:
src="http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg"
And I wish to get only this:
http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg
Any idea?
Thanks for your help.
Not exactly an answer to your question, but when parsing html, consider using a proper html parser:
foreach($html->find('img') as $element) {
echo $element->src . '<br />';
}
See: http://simplehtmldom.sourceforge.net/
$path[1] instead of $path[0]
echo $path[1];
$path[0] is the full string matched. $path[1] is the first grouping.
You could explode the string using " as a delimeter and then the second item in the array you get would be the right string:
$array = explode('"',$full_src);
$bit_you_want = $array[1];
Reworking your original function, it would be:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
$src = explode('"', $path);
echo $src[1];
}
}
Thanks Ithcy for his right answer.
I guess I've been too long to respond because he deleted it, I just don't know where his answer's gone...
So here is the one I've received by mail:
'|src="(.*?)"|i' makes no sense as a
regex. try '|src="([^"]+)"|i' instead.
(Which still isn't the most robust
solution but is better than what
you've got.)
Also, what everyone else said. You
want $path1, NOT $path[0]. You're
already extracting all the src
attributes into $matches[]. That has
nothing to do with $path[0]. If you're
not getting all of the src attributes
in the text, there is a problem
somewhere else in your code.
One more thing - you should use a real
HTML parser for this, because img tags
are not the only tags with src
attributes. If you're using this code
on raw HTML source, it's going to
match not just but
tags, etc.
— ithcy
I did everything he told me to do including using a HTML parser from Bart (2nd answer).
It works like a charm ! Thank you mate...

Extract node from XML like data without extra PHP libs

I am returned the following:
<links>
<image_link>http://img357.imageshack.us/img357/9606/48444016.jpg</image_link>
<thumb_link>http://img357.imageshack.us/img357/9606/48444016.th.jpg</thumb_link>
<ad_link>http://img357.imageshack.us/my.php?image=48444016.jpg</ad_link>
<thumb_exists>yes</thumb_exists>
<total_raters>0</total_raters>
<ave_rating>0.0</ave_rating>
<image_location>img357/9606/48444016.jpg</image_location>
<thumb_location>img357/9606/48444016.th.jpg</thumb_location>
<server>img357</server>
<image_name>48444016.jpg</image_name>
<done_page>http://img357.imageshack.us/content.php?page=done&l=img357/9606/48444016.jpg</done_page>
<resolution>800x600</resolution>
<filesize>38477</filesize>
<image_class>r</image_class>
</links>
I wish to extract the image_link in PHP as simply and as easily as possible. How can I do this?
Assume, I can not make use of any extra libs/plugins for PHP. :)
Thanks all
At Josh's answer, the problem was not escaping the "/" character. So the code Josh submitted would become:
$text = 'string_input';
preg_match('/<image_link>([^<]+)<\/image_link>/gi', $text, $regs);
$result = $regs[0];
Taking usoban's answer, an example would be:
<?php
// Load the file into $content
$xml = new SimpleXMLElement($content) or die('Error creating a SimpleXML instance');
$imagelink = (string) $xml->image_link; // This is the image link
?>
I recommend using SimpleXML because it's very easy and, as usoban said, it's builtin, that means that it doesn't need external libraries in any way.
You can use SimpleXML as it is built in PHP.
use regular expressions
$text = 'string_input';
preg_match('/<image_link>([^<]+)</image_link>/gi', $text, $regs);
$result = $regs[0];

Categories