How to add additional capture group to lookahead, lookbehind regex - php

I am using this regex: (?<=\[).+?(?=\]) to match data in my test string below.
This regex matches everything between my brackets. I need to also include the '1234567890ABC...' portion of my string as well. How would I do that?
This is my test string:
[one] [two] [three] 1234567890ABC...

You could make use of the \G anchor and match any char except the square brackets, or match \w+ to match only word characters.
(?:\[|\G(?!^)]\h\[?)\K[^][\s]+
(?: Non capture group
\[ Match [
| Or
\G(?!^) Assert the position at the previous match
]\h\[? Match ], horizontal whitespace char and optional [
)\K Close group and reset the match buffer
[^][\s]+ Match 1+ times any char except square brackets or whitespace char
Regex demo

You could try this pattern it's the same as the pattern you are using but it includes as well words and numbers after the brackets
(?<=\[).+?(?=\])\d+|\w+

Related

Why is non-greedy match consuming entire pattern even when followed by another non-greedy match

Using PHP8, I'm struggling to figure out how to conditionally match some key that may or may not appear in a string.
I would like to match both
-----------key=xyz---------------
AND
--------------------------
The dashes("-") could be any non-space character, and only used here for a cleaner to read example.
The regex is matching "key=..." if its containing group is greedy like below.
But this isn't adequate, because the full match will fail a "key=xyz" is missing the subject string.
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))
\S*
/x
if that capture group is non-greedy, then the regex just ignores the key match any "key=xyz"
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))?
\S*
/x
I tried debugging in this regex101 example but couldn't figure it out.
I sorted this out using multiple regexs, but hoping someone can help address my misunderstandings so I learn know how to make this work as a single regex.
Thanks
You may use:
/
^
\S*?
(?:
key=(?<foundkey>\w+)
\S*
)?
$
/xm
RegEx Demo
RegEx Breakdown:
^: Start
\S*?: Match 0 or more whitespaces non-greedy
(?:: Start Lookahead
key=(?<foundkey>\w+): Match key= text followed by 1+ word characters as capture group foundkey
\S*: Match 0 or more whitespaces
)?: End lookahead. ? makes it an optional match
$; End

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6
You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo
Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

I'm trying to capture data in a web url with regex

I'm trying to build my regex to match my urls
Here are 2 example urls
category/sorganiser/bouger/escalade/offre/78934/
category/sorganiser/savourer/offre/8040/
I would like to get the number just after offre (78934 and 8040)
as well as the word just before the word offre (escalade and savourer)
I did several tests but did not pass
^category/(((\w)+/){1,3})(\d+)/?$
^category/(((\w)+/){1,3})/offre/(\d+)/?$
https://regex101.com/r/S4MTvK/1
Thank you
Instead of repeating a single word char in a group (\w)+ you can repeat 1+ word chars in a single group (\w+)
Note to not match the / before /offre as it is already matched in the iteration ^category/(?:(\w+)/){1,3}
You can repeat the capture group inside a non capture group (?: to capture the last occurrence in the iteration.
^category/(?:(\w+)/){1,3}offre/(\d+)
The pattern matches
^ Start of string
category/ Match literally
(?: Non capture group
(\w+)/ Capture group 1, match 1+ word chars and match /
){1,3} Close non capture, repeat 1-3 times and capture group 1 contains the last occurrence of 1+ word chars which is escalade or savourer
offre/ Match literally
(\d+) Capture group 2, match 1+ digits
Regex demo
To also match an optional / before the end of the sting
^category/(?:(\w+)/){1,3}offre/(\d+)/?$
Regex demo

PHP preg_replace - Replace text part but not too soon

Using:
$text = preg_replace("/\[\[(.*?)SPACE(.*?)\]\]/im",'$2',$text);
for cleaning and get wordtwo
$text = '..text.. [[wordoneSPACE**wordtwo**]] ..moretext..';
but fails if text has [[ before
$text = '.. [[ ..text(not to cut).. [[wordoneSPACE**wordtwo**]] ..moretext..';
how can I limit to only where I have only the SPACE word?
If there can be no [ and ] inside the [[...]] you may use
$text = preg_replace("/\[\[([^][]*)SPACE([^][]*)]]/i",'$2',$text);
See the regex demo. [^][] negated character class will only match a char other than [ and ] and won't cross the [[...]] border.
Otherwise, use a tempered greedy token:
$text = preg_replace("/\[\[((?:(?!\[{2}).)*?)SPACE(.*?)]]/is",'$2',$text);
See this regex demo.
The (?:(?!\[{2}).)*? pattern will match any char, 0 or more repetitions but as few as possible, that does not start [[ char sequence, and won't cross the next entity [[ border.
Another option might be using possessive quantifiers.
In the first group you could use a negated character class to match any characters except square brackets or an S if it is followed by SPACE.
\[\[([^][S]++(?:S(?!PACE)|[^][S]+)*+)SPACE([^][]++)\]\]
In parts
\[\[ Match [[
( Capture group 1
[^][S]++ Match 1+ times any char except S, ] or [
(?: Non capturing group
S(?!PACE) Match either an S not followed by PACE
| Or
[^][S]+ Match 1+ times any char except S, ] or [
)*+ Close group and repeat 0+ times
) Close group 1
SPACE Match literally
( Capture group 2
[^][]++ Match 1+ times any char except ] or [
) Close group
\]\] Match ]]
Regex demo

PHP - ungreedy regular expression still a little 'greedy'

For my CMS I need replace multiline content between [?][/?] tags if it contains string %empty%, leaving untouched if %empty% mark is not found.
$a='
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
$r= preg_replace (
'/(\[\?\]).*?%empty%.*?(\[\/\?\])/s',
"REPLACED",
$a ) ;
echo $r;
Right result:
REPLACED
text
REPLACED
text
It works well in almost every combination, except if first line is unmatched. In this case is replaced all content between first [?] and last [/?]
$a='
[?]<h1>%!empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
Wrong result:
REPLACED
text
Expected:
[?]
<h1>%!empty%</h1>
<p>test</p>
[/?]
text
REPLACED
text
I am using both ungreedy and 'lazy' regular exceptions with same result. I thing that I need explicit define second [/?] in regexp, but without success.
For your current example data, if you want to match from [?] till [/?] and in between there can not be [?] and there must be %empty%, you might make use of a tempered greedy token.
Using the /s modifier to make the dot match a newline:
\[\?\](?:(?!\[/?\?\]).)*%empty%(?:(?!\[\?\]).)*\[/\?]
Explanation
\[\?\] Match [?]
(?: Non capturing group
(?!\[/?\?\]). Assert what is directly on the right is not [?] or [/?]. Then match any char.
)* Close non capturing group and repeat 0+ times
%empty% Match literally
(?: Non capturing group
(?!\[\?\]). Assert what is directly on the right is not [?]. Then match any char.
)* Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo
Edit
#Casimir et Hippolyte suggests a more performant pattern using a Unrolled Star Alternation Solution approach:
\[\?\][^[%]*+(?:\[(?!\?])[^[%]*|%(?!empty%)[^[{%]*)*+%empty%[^[]*+(?:\[(?!/?\?])[^[]*)*+\[/\?]
Explanation
\[\?\] Match [?]
[^[%]*+ Negated character class, match any char except [ ] %
(?: Non capturing group
\[(?!\?]) Match [, assert what is directly on the right is not ?]
[^[%]*If that is the case, match 0+ times any char except [ %
| Or
%(?!empty%) Match %, assert what is directly on the right is not empty%
[^[{%]* If that is the case, match 0+ times any char except [ {
)*+ Close non capturing group and repeat 0+ times using a possessive quantifier
%empty%[^[]*+ Match %empty% and 1+ times any char except [ ]
(?: Non capturing group
\[(?!/?\?]) Match [, assert what is directly on the right is not an optional / and ?]
[^[]* If that is the case, match 0+ times any char except [
)*+ Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo

Categories