Syntax regex to extract the source of an image - php

Ahoy there!
I can't "guess" witch syntax should I use to be able to extract the source of an image but simply the web address not the src= neither the quotes?
Here is my piece of code:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
echo $path[0];
}
}
When I use it I got this printed:
src="http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg"
And I wish to get only this:
http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg
Any idea?
Thanks for your help.

Not exactly an answer to your question, but when parsing html, consider using a proper html parser:
foreach($html->find('img') as $element) {
echo $element->src . '<br />';
}
See: http://simplehtmldom.sourceforge.net/

$path[1] instead of $path[0]

echo $path[1];
$path[0] is the full string matched. $path[1] is the first grouping.

You could explode the string using " as a delimeter and then the second item in the array you get would be the right string:
$array = explode('"',$full_src);
$bit_you_want = $array[1];
Reworking your original function, it would be:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
$src = explode('"', $path);
echo $src[1];
}
}

Thanks Ithcy for his right answer.
I guess I've been too long to respond because he deleted it, I just don't know where his answer's gone...
So here is the one I've received by mail:
'|src="(.*?)"|i' makes no sense as a
regex. try '|src="([^"]+)"|i' instead.
(Which still isn't the most robust
solution but is better than what
you've got.)
Also, what everyone else said. You
want $path1, NOT $path[0]. You're
already extracting all the src
attributes into $matches[]. That has
nothing to do with $path[0]. If you're
not getting all of the src attributes
in the text, there is a problem
somewhere else in your code.
One more thing - you should use a real
HTML parser for this, because img tags
are not the only tags with src
attributes. If you're using this code
on raw HTML source, it's going to
match not just but
tags, etc.
— ithcy
I did everything he told me to do including using a HTML parser from Bart (2nd answer).
It works like a charm ! Thank you mate...

Related

How to use htmlspecialchars only on <code></code> tags.

I'm using WordPress and would like to create a function that applies the PHP function htmlspecialchars only to code contained between <code></code> tags. I appreciate this may be fairly simple but I'm new to PHP and can't find any references on how to do this.
So far I have the following:
function FilterCodeOnSave( $content, $post_id ) {
return htmlspecialchars($content, ENT_NOQUOTES);
}
Obviously the above is very simple and performs htmlspecialchars on the entire content of my page. I need to limit the function to only apply to the HTML between code tags (there may be multiple code tags on each page).
Any help would be appreciated.
Thanks,
James
EDIT: updated to avoid multiple CODE tags
Try this:
<?php
// test data
$textToScan = "Hi <code>test12</code><br>
Line 2 <code><br>
Test <b>Bold</b><br></code><br>
Test
";
// debug:
echo $textToScan . "<hr>";
// the regex pattern (case insensitive & multiline
$search = "~<code>(.*?)</code>~is";
// first look for all CODE tags and their content
preg_match_all($search, $textToScan, $matches);
//print_r($matches);
// now replace all the CODE tags and their content with a htmlspecialchars() content
foreach($matches[1] as $match){
$replace = htmlspecialchars($match);
// now replace the previously found CODE block
$textToScan = str_replace($match, $replace, $textToScan);
}
// output result
echo $textToScan;
?>
Use DOMDocument to get all <code> tags;
// in this example $dom is an instance of DOMDocument
$code_tags = $dom->getElementsByTagName('code');
if ($code_tags) {
foreach ($code_tags as $node) {
// [...]
}
// [...]
}
i know this is a little bit late, but you can call the htmlspecialchars function first and then when outputting call the htmlspecialchars_decode function

Grab text between specific tags in PHP file

Sorry if this question has already been answered elsewhere. I looked through stack overflow and couldn't find exactly what I was looking for.
I need to know how to scan multiple php files in a single directory (test/ for example), and extract text between specific "tagged" areas on each php file.
Example of "tagged" areas:
<?
/*
{('test1')}
*/
?>
<div>text here</div>
<?
/*
{('test2')}
*/
?>
And the code would display test1, test2, etc. and ignore anything else. I tried looking into fopen(), file_get_contents and preg_match_all but each time they only find the first occurrence and not every occurrence of the "tagged" areas. Any help would be great!
EDIT - WHAT I CURRENTLY HAVE:
foreach (glob("templates/*.php") as $fn) {
$file = file_get_contents($fn);
preg_match_all("#\{\('(\w+)'\)}#", $file, $matches);
$variable = join('', $matches[1]);
echo $variable.'<br />';
How do I add array_chunk to this so that each iteration of test is echo'd as it's own variable instead of grouped into an array. I tried this:
$variable = array_chunk($matches[1],1);
with no success, it just prints "Array". any help would be great things. I will post in a new question if I don't get a response.
This is how you would escape the regex:
foreach (glob("template/*.php") as $fn) {
$file = file_get_contents($fn);
preg_match_all("#\{\('(\w+)'\)}#", $file, $matches);
print_r($matches);
}
Eugen has shown how to match the PHP/PI <? tags and /* comment sections as well. You may just need \s* in between those.
$filepattern='test/*.php';
$tagpattern='/\<\?\n\/\*\n\{\(\'([^\']+)\'\)\}\n\*\/\n\?\>/';
$files=glob($filepattern);
foreach ($files as $file) {
$content=file_get_contents($file);
$count=preg_match_all($tagpattern,$content,$matches);
if ($count<1) continue;
//Whatever you want to do with the matches!
foreach ($matches[1] as $match) echo "$file: $match\n";
}
Regex is unsuitable and slow for this. try php dom or this nice library
http://simplehtmldom.sourceforge.net/

Strip directory structure in HTML

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Extract Image Sources from text in PHP - preg_match_all required

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

Categories