PHP preg_match_all does not match everything - php

Consider the following code snippet:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
preg_match_all('/DELIM1(.*?)DELIM2(.*?)/', $example, $matches);
$matches array becomes:
array:3 [
0 => array:2 [
0 => "DELIM1test1DELIM2"
1 => "DELIM1test3DELIM2"
]
1 => array:2 [
0 => "test1"
1 => "test3"
]
2 => array:2 [
0 => ""
1 => ""
]
]
As you can see, it fails to get test2 and test4. Any reason why that happens and what could be a possible solution? Thank you.

.*? is non-greedy; if you have no constraint after it, it will match the minimum necessary: zero characters. You need a constraint after it to force it to match more than trivially. For example:
/DELIM1(.*?)DELIM2(.*?)(?=DELIM1|$)/

Lazy subpatterns at the end of the patter match either 0 (*?) or 1 (+?) characters because they match as few as possible.
You can still use lazy matching and append a lookahead that will require a DELIM1 to appear after the value or the end of string:
/DELIM1(.*?)DELIM2(.*?)(?=$|DELIM1)/
See demo. It is very close in terms of performance with a tempered greedy token (DELIM1(.*?)DELIM2((?:(?!DELIM1).)*) - demo).
However, the best approach is to unroll it:
DELIM1(.*?)DELIM2([^D]*(?:D(?!ELIM1)[^D]*)*)
See another demo

preg_split would be better:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
$keywords = preg_split("/DELIM1|DELIM2/", $example,0,PREG_SPLIT_NO_EMPTY);
print_r($keywords);
output:
Array
(
[0] => test1
[1] => test2
[2] => test3
[3] => test4
)
demo: http://ideone.com/s5nC0k

Those values are OUTSIDE of your anchors, so they won't get matched. e.g. (with some extra spaces)
str: DELIM1 test1 DELIM2 test2 DELIM1 test3 DELIM2 test4
pat: DELIM1 (.*?) DELIM2 (.*?) DELIM1 (.*?) DELIM2 (.*?)
match #1 match #2
(.*?) is a non-greedy match, and can/will match a 0-length string. Since the boundary between M2 and te is a 0-length string, that invisible zero-length character matches and the pattern terminates there.

You can use this negative lookahead regex:
preg_match_all('/DELIM1((?:(?!DELIM1|DELIM2).)*)DELIM2((?:(?!DELIM1|DELIM2).)*)/',
$example, $matches);
(?:(?!DELIM1|DELIM2).)* will match 0 or more of any character that doesn't have DELIM1 or DELIM2 at next position.
Output:
print_r($matches);
Array
(
[0] => Array
(
[0] => DELIM1test1DELIM2test2
[1] => DELIM1test3DELIM2test4
)
[1] => Array
(
[0] => test1
[1] => test3
)
[2] => Array
(
[0] => test2
[1] => test4
)
)

Related

PHP preg_split split by group 1

I have these inputs:
Rosemary Hess (2018) (Germany) (all media)
Jackie H Spriggs (catering)
I want to split them by the first parentheses, the output i want:
array:2 [
0 => "Rosemary Hess"
1 => "(2018) (Germany) (all media)"
]
array:2 [
0 => "Jackie H Spriggs"
1 => "(catering)"
]
I tried these but not working correctly :
preg_split("/(\s)\(/", 'Rosemary Hess (2018) (Germany) (all media)')
But it splits every space with parentheses and returns five items rather two.
You can use
$strs= ["Rosemary Hess (2018) (Germany) (all media)", "Jackie H Spriggs (catering)"];
foreach ($strs as $s){
print_r( preg_split('~\s*(?=\([^()]*\))~', $s, 2) );
}
// => Array ( [0] => Rosemary Hess [1] => (2018) (Germany) (all media) )
// => Array ( [0] => Jackie H Spriggs [1] => (catering) )
See the PHP demo. See the regex demo.
The preg_split third $limit argument set to 2 makes it split the string with the first occurrence of the pattern that matches:
\s* - 0+ whitespaces
(?=\([^()]*\)) - that are followed with (, 0 or more chars other than ( and ) and then a ).

How capture part of string included delimiter?

Having a string so formed:
#foo1 foo2# foo3 foo4 #foo5# ##foo6# #foo7## #foo8 foo9#
The expected should be an array so formed:
array (
[0] => #foo1 foo2#
[1] => foo3
[2] => foo4
[3] => #foo5#
[4] => ##foo6# #foo7##
[5] => #foo8 foo9#
);
Or more simply splitting for space but capuring all which inside a delimiter, included it... in a array.
NOTE: The string can to have repeated it.
You can use preg_match_all using this alternation regex:
/(#+).*?\1|\S+/
RegEx Demo
RegEx Breakup:
(#+) - Match 1 or more # in captured group #1
.*? - Match 0 or more of any characters (non-greedy)
\1 - Back-reference to captured group #1 to make sure we have same #s on RHS
| - OR
\S+ - one or more non-white-space characters
Code:
$str = '#foo1 foo2# foo3 foo4 #foo5# ##foo6# #foo7## #foo8 foo9#';
preg_match_all('/(#+).*?\1|\S+/', $str, $matches);
print_r($matches[0]);
Output:
Array
(
[0] => #foo1 foo2#
[1] => foo3
[2] => foo4
[3] => #foo5#
[4] => ##foo6# #foo7##
[5] => #foo8 foo9#
)

Regular expression doesn't work as expected: '/=(\w+\s*)+=/'

This is what I have:
<?php
preg_match_all('/=(\w+\s*)+=/', 'aaa =bbb ccc ddd eee= zzz', $match);
print_r($match);
It matches only eee:
Array
(
[0] => Array
(
[0] => =bbb ccc ddd eee=
)
[1] => Array
(
[0] => eee
)
)
I need it to match bbb, ccc, ddd, eee, e.g.:
...
[1] => Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
...
Where is the problem?
Try this regex:
(\w+)(?=[^=]*=[^=]*$)
Explaining:
(\w+) # group all words
(?= # only if right after can be found:
[^=]* # regardless of non '=' character
= # one '=' character
[^=]*$ # non '=' character till the end makes sure the first words are eliminated... You can try remove it from regex101 to see what happens.
)
Regex live here.
Hope it helps.
Thats is expected behaviour. Group captures are overwritten on repetition.
1 group, 1 capture
Instead of trying to get them in 1 match attempt, you should match one token on each attempt. Use \G to match the end of last match.
Something like this should work:
/(?(?=^)[^=]*+=(?=.*=)|\G\s+)([^\s=]+)/
regex101 Demo
Regex break-down
(?(?=^) ... | ... ) IF on start of string
[^=]*+= consume everything up to the first =
(?=.*=) and check there's a closing = as well
ELSE
\G\s+ only match if the last match ended here, consuming preceding spaces
([^\s=]+) Match 1 token, captured in group 1.
If you're also interested in matching more than 1 set of tokens, you need to match the text in between sets as well:
/(?(?=^)[^=]*+=(?=.*=)|\G\s*+(?:=[^=]*+=(?=.*=))?)([^\s=]+)/
regex101 Demo
Your regex starts and ends with an =, so the only possible match is:
=bbb ccc ddd eee=
You can use preg_replace with preg_split, i.e.:
$string = "aaa =bbb ccc ddd eee= zzz";
$matches = preg_split('/ /', preg_replace('/^.*?=|=.*?$/', '', $string));
print_r($matches);
OUTPUT:
Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
DEMO:
http://ideone.com/pAmjbk

Regexp tip request

I have a string like
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
I want to explode it to array
Array (
0 => "first",
1 => "second[,b]",
2 => "third[a,b[1,2,3]]",
3 => "fourth[a[1,2]]",
4 => "sixth"
}
I tried to remove brackets:
preg_replace("/[ ( (?>[^[]]+) | (?R) )* ]/xis",
"",
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
);
But got stuck one the next step
PHP's regex flavor supports recursive patterns, so something like this would work:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth";
preg_match_all('/[^,\[\]]+(\[([^\[\]]|(?1))*])?/', $text, $matches);
print_r($matches[0]);
which will print:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
)
The key here is not to split, but match.
Whether you want to add such a cryptic regex to your code base, is up to you :)
EDIT
I just realized that my suggestion above will not match entries starting with [. To do that, do it like this:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth,[s,[,e,[,v,],e,],n]";
preg_match_all("/
( # start match group 1
[^,\[\]] # any char other than a comma or square bracket
| # OR
\[ # an opening square bracket
( # start match group 2
[^\[\]] # any char other than a square bracket
| # OR
(?R) # recursively match the entire pattern
)* # end match group 2, and repeat it zero or more times
] # an closing square bracket
)+ # end match group 1, and repeat it once or more times
/x",
$text,
$matches
);
print_r($matches[0]);
which prints:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
[5] => [s,[,e,[,v,],e,],n]
)

regex match between 2 strings

For example I have the text
a1aabca2aa3adefa4a
I want to extract 2 and 3 with a regex between abc and def, so 1 and 4 should be not included in the result.
I tried this
if(preg_match_all('#abc(?:a(\d)a)+def#is', file_get_contents('test.txt'), $m, PREG_SET_ORDER))
print_r($m);
I get this
> Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 3
)
)
But I want this
Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 2
[2] => 3
)
)
Is this possible with one preg_match_all call? How can I do it?
Thanks
preg_match_all(
'/\d # match a digit
(?=.*def) # only if followed by <anything> + def
(?!.*abc) # and not followed by <anything> + abc
/x',
$subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
works on your example. It assumes that there is exactly one instance of abc and def per line in your string.
The reason why your attempt didn't work is that your capturing group (\d) that matches the digit is within another, repeated group (?:a(\d)a)+. With every repetition, the result of the capture is overwritten. This is how regular expressions work.
In other words - see what's happening during the match:
Current position Current part of regex Capturing group 1
--------------------------------------------------------------
a1a no match, advancing... undefined
abc abc undefined
a2a (?:a(\d)a) 2
a3a (?:a(\d)a) (repeated) 3 (overwrites 2)
def def 3
You ask if it is possible with a single preg_match_all.
Indeed it is.
This code outputs exactly what you want.
<?php
$subject='a1aabca2aa3adefa4a';
$pattern='/abc(?:a(\d)a+(\d)a)def/m';
preg_match_all($pattern, $subject, $all_matches,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
$res[0]=$all_matches[0][0][0];
$res[1]=$all_matches[1][0][0];
$res[2]=$all_matches[2][0][0];
var_dump($res);
?>
Here is the output:
array
0 => string 'abca2aa3adef' (length=12)
1 => string '2' (length=1)
2 => string '3' (length=1)

Categories