PHP - ungreedy regular expression still a little 'greedy'

PHP - ungreedy regular expression still a little 'greedy' - php

For my CMS I need replace multiline content between [?][/?] tags if it contains string %empty%, leaving untouched if %empty% mark is not found.
$a='
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
$r= preg_replace (
'/(\[\?\]).*?%empty%.*?(\[\/\?\])/s',
"REPLACED",
$a ) ;
echo $r;
Right result:
REPLACED
text
REPLACED
text
It works well in almost every combination, except if first line is unmatched. In this case is replaced all content between first [?] and last [/?]
$a='
[?]<h1>%!empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
Wrong result:
REPLACED
text
Expected:
[?]
<h1>%!empty%</h1>
<p>test</p>
[/?]
text
REPLACED
text
I am using both ungreedy and 'lazy' regular exceptions with same result. I thing that I need explicit define second [/?] in regexp, but without success.

For your current example data, if you want to match from [?] till [/?] and in between there can not be [?] and there must be %empty%, you might make use of a tempered greedy token.
Using the /s modifier to make the dot match a newline:
\[\?\](?:(?!\[/?\?\]).)*%empty%(?:(?!\[\?\]).)*\[/\?]
Explanation
\[\?\] Match [?]
(?: Non capturing group
(?!\[/?\?\]). Assert what is directly on the right is not [?] or [/?]. Then match any char.
)* Close non capturing group and repeat 0+ times
%empty% Match literally
(?: Non capturing group
(?!\[\?\]). Assert what is directly on the right is not [?]. Then match any char.
)* Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo
Edit
#Casimir et Hippolyte suggests a more performant pattern using a Unrolled Star Alternation Solution approach:
\[\?\][^[%]*+(?:\[(?!\?])[^[%]*|%(?!empty%)[^[{%]*)*+%empty%[^[]*+(?:\[(?!/?\?])[^[]*)*+\[/\?]
Explanation
\[\?\] Match [?]
[^[%]*+ Negated character class, match any char except [ ] %
(?: Non capturing group
\[(?!\?]) Match [, assert what is directly on the right is not ?]
[^[%]*If that is the case, match 0+ times any char except [ %
| Or
%(?!empty%) Match %, assert what is directly on the right is not empty%
[^[{%]* If that is the case, match 0+ times any char except [ {
)*+ Close non capturing group and repeat 0+ times using a possessive quantifier
%empty%[^[]*+ Match %empty% and 1+ times any char except [ ]
(?: Non capturing group
\[(?!/?\?]) Match [, assert what is directly on the right is not an optional / and ?]
[^[]* If that is the case, match 0+ times any char except [
)*+ Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo

Related

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6

You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo

Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

I'm trying to capture data in a web url with regex

I'm trying to build my regex to match my urls
Here are 2 example urls
category/sorganiser/bouger/escalade/offre/78934/
category/sorganiser/savourer/offre/8040/
I would like to get the number just after offre (78934 and 8040)
as well as the word just before the word offre (escalade and savourer)
I did several tests but did not pass
^category/(((\w)+/){1,3})(\d+)/?$
^category/(((\w)+/){1,3})/offre/(\d+)/?$
https://regex101.com/r/S4MTvK/1
Thank you

Instead of repeating a single word char in a group (\w)+ you can repeat 1+ word chars in a single group (\w+)
Note to not match the / before /offre as it is already matched in the iteration ^category/(?:(\w+)/){1,3}
You can repeat the capture group inside a non capture group (?: to capture the last occurrence in the iteration.
^category/(?:(\w+)/){1,3}offre/(\d+)
The pattern matches
^ Start of string
category/ Match literally
(?: Non capture group
(\w+)/ Capture group 1, match 1+ word chars and match /
){1,3} Close non capture, repeat 1-3 times and capture group 1 contains the last occurrence of 1+ word chars which is escalade or savourer
offre/ Match literally
(\d+) Capture group 2, match 1+ digits
Regex demo
To also match an optional / before the end of the sting
^category/(?:(\w+)/){1,3}offre/(\d+)/?$
Regex demo

How to add additional capture group to lookahead, lookbehind regex

I am using this regex: (?<=\[).+?(?=\]) to match data in my test string below.
This regex matches everything between my brackets. I need to also include the '1234567890ABC...' portion of my string as well. How would I do that?
This is my test string:
[one] [two] [three] 1234567890ABC...

You could make use of the \G anchor and match any char except the square brackets, or match \w+ to match only word characters.
(?:\[|\G(?!^)]\h\[?)\K[^][\s]+
(?: Non capture group
\[ Match [
| Or
\G(?!^) Assert the position at the previous match
]\h\[? Match ], horizontal whitespace char and optional [
)\K Close group and reset the match buffer
[^][\s]+ Match 1+ times any char except square brackets or whitespace char
Regex demo

You could try this pattern it's the same as the pattern you are using but it includes as well words and numbers after the brackets
(?<=\[).+?(?=\])\d+|\w+

PHP preg_replace - Replace text part but not too soon

Using:
$text = preg_replace("/\[\[(.*?)SPACE(.*?)\]\]/im",'$2',$text);
for cleaning and get wordtwo
$text = '..text.. [[wordoneSPACE**wordtwo**]] ..moretext..';
but fails if text has [[ before
$text = '.. [[ ..text(not to cut).. [[wordoneSPACE**wordtwo**]] ..moretext..';
how can I limit to only where I have only the SPACE word?

If there can be no [ and ] inside the [[...]] you may use
$text = preg_replace("/\[\[([^][]*)SPACE([^][]*)]]/i",'$2',$text);
See the regex demo. [^][] negated character class will only match a char other than [ and ] and won't cross the [[...]] border.
Otherwise, use a tempered greedy token:
$text = preg_replace("/\[\[((?:(?!\[{2}).)*?)SPACE(.*?)]]/is",'$2',$text);
See this regex demo.
The (?:(?!\[{2}).)*? pattern will match any char, 0 or more repetitions but as few as possible, that does not start [[ char sequence, and won't cross the next entity [[ border.

Another option might be using possessive quantifiers.
In the first group you could use a negated character class to match any characters except square brackets or an S if it is followed by SPACE.
\[\[([^][S]++(?:S(?!PACE)|[^][S]+)*+)SPACE([^][]++)\]\]
In parts
\[\[ Match [[
( Capture group 1
[^][S]++ Match 1+ times any char except S, ] or [
(?: Non capturing group
S(?!PACE) Match either an S not followed by PACE
| Or
[^][S]+ Match 1+ times any char except S, ] or [
)*+ Close group and repeat 0+ times
) Close group 1
SPACE Match literally
( Capture group 2
[^][]++ Match 1+ times any char except ] or [
) Close group
\]\] Match ]]
Regex demo

require whitespace in first group regex

In the process of writing a custom little template engine I want to match a block like
{foreach foo as bar}
{bar.name}
{endforeach}
//with regex
preg_match_all('/{(?!{)([\w\s]+)}(?!})(.*?){(?!{)(\w+)}(?!})/us', $string, $matches, PREG_SET_ORDER)
So the first group must have alnum and whitespace chars with [\w\s]+
the negative lookahead (?!{) is to not allow blocks that start with {{
//so a block like
{{foreach bla as bla}}
//would not be matched.
The problem is that this regex also matches {var} without whitespace.
And this is what I dont understand due to my first class definition
of [\w\s]+

To match at least 2 word char sequences separated with at least 1 whitespace, and allow leading and trailing whitespaces, you may use
\s*\w+(?:\s+\w+)+\s*
In details:
\s* - 0+ whitespaces
\w+ - 1 or more word chars
(?: - start of a non-capturing group that is used for grouping subpattern sequences*:
\s+ - 1 or more whitespaces
\w+ - 1 or more word chars
)+ - 1 or more occurrences of the group
\s* - trailing 0+ whitespace chars.
The entire regex will look like
{(?!{)(\s*\w+(?:\s+\w+)+\s*)}(?!})(.*?){(?!{)(\w+)}(?!})
See the updated regex demo

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - ungreedy regular expression still a little 'greedy' - php

Related

Maximum character length for PHP multiline regular expressions?

I'm trying to capture data in a web url with regex

How to add additional capture group to lookahead, lookbehind regex

PHP preg_replace - Replace text part but not too soon

require whitespace in first group regex

Categories

Resources