Get also non-matching strings from preg_match_all - php

I have some text with <img> tags in is that I need to divvy up. It's in the format
<img.../> Text text text <img.../>text text text<img.../> text text text
I have my regex qworking in preg_match_all so that I get
Array
(
[0] => Array
(
[0] => <img ... />
[1] => <img ... />
[2] => <img ... />
[3] => <img ... />
)
But it would be really nice if I could get
Array
(
[0] => Array
(
[0] => <img ... />
[1] => text text text
[2] => <img ... />
[3] => text text text
[4] => <img ... />
[5] => text text text
)
I've tried a few things but I really don't have a good understanding of PCREs. I don't want to use preg_split if I can avoid it because each of the images tags are different.
(I understand that a general HTML parser cannot be written with regular expressions, but in this case, I think this will work because the input data that I'm working is in the form I described. There aren't going to be any nested image tags that I'll need to worry about.)
PS I've tried /!<img.+>/, /!(<img.+>)/, and /(!(<img.+>))/ to get the non-matches, but it returns an empty array. I don't know a good way to debug regexes to know what I'm doing wrong.

I don't know what your issue (or actual code for that matter) is, but:
$r = preg_split('#<img[^>]+>#', $source, 0, PREG_SPLIT_DELIM_CAPTURE);
results in:
Array
(
[0] => <img.../>
[1] => Text text text
[2] => <img.../>
[3] => text text text
[4] => <img.../>
[5] => text text text
)
In place of a proper regex, you can keep using your fixed strings (I presume) with #<img1>|<img2>|<img3># of course.

You can have the information you want just not quite in the right format by doing this:
preg_match_all('~(<img[^>]*>)([^<]+)~', $str, $matches);
//if inside your "text text text" areas you have other html tags, use this:
preg_match_all('~(<img[^>]*>)(.+?)(?=<img|$)~', $str, $matches);
At this point, $matches[0] contains the entire matched string. $matches[1] contains all of the matches from the first set of parenthesis and $matches[2] contains all of the matches from the second set of parenthesis.
Array (
[0] => Array (
[0] => <img.../> Text text text
[1] => <img.../>text text text
[2] => <img.../> text text text
)
[1] => Array (
[0] => <img.../>
[1] => <img.../>
[2] => <img.../>
)
[2] => Array (
[0] => Text text text
[1] => text text text
[2] => text text text
)
)
Now if you really need it formatted the way you would like, just add these lines of code:
$answer = array();
foreach($matches[0] as $i=>$match){
$answer[] = $matches[1][$i];
$answer[] = $matches[2][$i];
};
$answer now looks like this:
Array (
[0] => <img ... />
[1] => Text text text
[2] => <img ... />
[3] => text text text
[4] => <img ... />
[5] => text text text
)

Related

Retrieving text outside square brackets in PHP

I need some way of capturing the text outside square brackets. So for example, the following string:
My [ground]name[test]Jhon[random]petor [shorts].
I m using the below preg match expression but the result could not be expected
preg_match_all("/\[[^\]]*\]/", $text, $matches);
it giving me the result which is within the square bracket.
Result :
Array (
[0] => [ground]
[1] => [test]
[2] => [random]
[3] => [shorts]
)
Expect Output:
Array (
[0] => [My]
[1] => [name]
[2] => [Jhon]
[3] => [petor]
)
Any help that would be great
You can extend the pattern adding \K to clean what is matched so far and then using an alternation to match 1 or more word characters.
\[[^][]+]\K|\w+
See a regex demo
$re = '/\[[^][]+]\K|\w+/';
$str = 'My [ground]name[test]Jhon[random]petor [shorts].';
preg_match_all($re, $str, $matches);
print_r(array_values(array_filter($matches[0])));
Output
Array
(
[0] => My
[1] => name
[2] => Jhon
[3] => petor
)

regex for html attributes, need fix

Need to fix this regex which extract html attributes in array for me by preg_mach_all function in php:
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
the attributes example is:
style="width: 462px;" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="Screenshot from 2016-02-09 21:54:47.png"
working example in finddle: https://regex101.com/r/QE9XGD/1
because of equals sign in the end of src attribute, I got wrong array:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="
)
[1] => Array
(
[0] => style
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......
)
[2] => Array
(
[0] => width: 462px;
[1] => data-filename=
)
)
correct array should be like this:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......="
[2] => data-filename="Screenshot from 2016-02-09 1:54:47.png"
)
[1] => Array
(
[0] => style
[1] => src
[2] => data-filename
)
[2] => Array
(
[0] => width: 462px;
[1] => data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=
[2] => Screenshot from 2016-02-09 1:54:47.png
)
)
how to fix this regex to get correct answer?
Remember I use this regex not just in image attributes extraction, is a universal regex for all type of html tags
(\S+?)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
The change is to make the attribute name evaluation lazy, so it only eats until it finds an =.
Working example on regex101
That being said, I'm fairly confident this regex can be reduced.
([^\s=]+)=('?)("?)([^>"']*)\2\3 is probably the best option:
It takes about 2% of the time of lazy evaluation and will do both singly and doubly quoted attributes. The big change here is the capture groups you want are the 1st and 4th. As far as I'm aware this will work on any html except: tag='"value'
regex101

php explode doesn't second item

I'm trying to split this image string: $output = "<img typeof="foaf:Image" src="http://asite.dev/sites/default/files/video/MBI_Part%201_v9.jpg" width="1920" height="1080" alt="" />
I'm doing it like this: $split = explode('"', $output);
But when I print_r($split);it returns:
Array ( [0] => typeof="foaf:Image" [2] => src="http://makingitcount.dev/sites/default/files/video/MBI_Part%201_v9.jpg" [3] => width="1920" [4] => height="1080" [5] => alt="" [6] => /> )
No second value! Where'd it go? split[1] throws an error, of course. I also notice that the "<img" part of the string isn't in the array either.
The problem stems from the parsing of the html tag. If you remove the <img at the beginning of the html string, you'll notice the rest of the attributes will parse into an array with a proper number sequence (including a '1' element). You can solve your problem by formatting your quotes to tell php not to parse the html and treat the entire unit strictly as a string.
If you want to bypass this whole mess, you can also just use regular expression matching to collect tag information and pass it into an array. $matches[0][*] will contain all of your tag attributes, and $matches[1] contains the tag itself (img)
$output = '<img typeof="Image" src="http://asite.dev/sites/default/files/video/MBI_Part%201_v9.jpg" width="1920" height="1080" alt="" />';
$pattern = '( \w+|".*?")';
preg_match_all($pattern, $output, $matches);
preg_match("[\w+]",$output,$matches[1]);
print_r($matches);
which gives you
Array ( [0] => Array ( [0] => typeof [1] => "Image" [2] => src [3] => "http://asite.dev/sites/default/files/video/MBI_Part%201_v9.jpg" [4] => width [5] => "1920" [6] => height [7] => "1080" [8] => alt [9] => "" )
[1] => Array ( [0] => img ) )

need find from text all "src" elements

i need get from text all "src" elements. "src" can have " or '.
Text found in the well, but if element have id, style... They also grabbed.
I need only src value.
My code:
$html = 'text text <img src="img1.png"/> as as <img src=\'second.gif\' id ="test" /> as';
preg_match_all('/src=("|\')([^"]*)("|\')/', $html, $htmlSrc);
echo '<pre>';
print_r($htmlSrc);
Array
(
[0] => Array
(
[0] => src="img1.png"
[1] => src='second.gif' id ="
)
[1] => Array
(
[0] => "
[1] => '
)
[2] => Array
(
[0] => img1.png
[1] => second.gif' id =
)
[3] => Array
(
[0] => "
[1] => "
)
)
Regexp is a bad idea and you will probably end up with unmaintainable and unreliable code. It would be easy and reliable if you use an HTML parser. You can find an example here: http://simplehtmldom.sourceforge.net/
preg_match_all('/src="|\'([^"\']*)"|\'/', $html, $htmlSrc);
print_r($htmlSrc[2]);
Seems to work better.

Extract HTML Tags using preg_split

i have a string
$string = 'this is test <b>bold</b> this is another test <img src="#"> image' ;
i want split html tag alone & normal text alone.
need the following output like :
[0] => this is test
[1] => <b>bold</b>
[2] => this is another test
[3] => <img src="#">
[4] => image
using this code.
$strip = preg_split('/\s+(?![^<>]+>)/m', $string , -1, PREG_SPLIT_DELIM_CAPTURE) ;
output.
[0] => this
[1] => is
[2] => test
[3] => <b>bold</b>
[4] => this
[5] => .....
i'm newbie. pls help!
I find it easier to get that result using preg_match:
$string = 'this is test <b>bold</b> this is another test <img src="#"> image <hr/>';
preg_match_all('/<([^\s>]+)(.*?)>((.*?)<\/\1>)?|(?<=^|>)(.+?)(?=$|<)/i',$string,$result);
$result = $result[0];
// assign the result to the variable
foreach ($result as &$group) {
$group = preg_replace('/^\s*(.*?)\s*$/','$1',$group);
// this is to eliminate preceding and trailing spaces
}
print_r($result);
EDIT:
I was assuming there should be at least 1 character in between the opening and the closing of a tag, but it's not necessary so I changed the second + into an * and I took into account the possibility of case insensitivity in tags.
Output:
Array
(
[0] => this is test
[1] => <b>bold</b>
[2] => this is another test
[3] => <img src="#">
[4] => image
[4] => <hr/>
)
EDIT 2:
This won't work with irregular situations such as thode exemplified in the comments:
foo<b>bar<i>ital</b>ic</i> or foo<b>bar<b>baz</b>fail</b>
To make it work the RegEx should be tweaked to look inside the matches and process them accordingly.

Categories