PHP preg_replace - Replace text part but not too soon - php

Using:
$text = preg_replace("/\[\[(.*?)SPACE(.*?)\]\]/im",'$2',$text);
for cleaning and get wordtwo
$text = '..text.. [[wordoneSPACE**wordtwo**]] ..moretext..';
but fails if text has [[ before
$text = '.. [[ ..text(not to cut).. [[wordoneSPACE**wordtwo**]] ..moretext..';
how can I limit to only where I have only the SPACE word?

If there can be no [ and ] inside the [[...]] you may use
$text = preg_replace("/\[\[([^][]*)SPACE([^][]*)]]/i",'$2',$text);
See the regex demo. [^][] negated character class will only match a char other than [ and ] and won't cross the [[...]] border.
Otherwise, use a tempered greedy token:
$text = preg_replace("/\[\[((?:(?!\[{2}).)*?)SPACE(.*?)]]/is",'$2',$text);
See this regex demo.
The (?:(?!\[{2}).)*? pattern will match any char, 0 or more repetitions but as few as possible, that does not start [[ char sequence, and won't cross the next entity [[ border.

Another option might be using possessive quantifiers.
In the first group you could use a negated character class to match any characters except square brackets or an S if it is followed by SPACE.
\[\[([^][S]++(?:S(?!PACE)|[^][S]+)*+)SPACE([^][]++)\]\]
In parts
\[\[ Match [[
( Capture group 1
[^][S]++ Match 1+ times any char except S, ] or [
(?: Non capturing group
S(?!PACE) Match either an S not followed by PACE
| Or
[^][S]+ Match 1+ times any char except S, ] or [
)*+ Close group and repeat 0+ times
) Close group 1
SPACE Match literally
( Capture group 2
[^][]++ Match 1+ times any char except ] or [
) Close group
\]\] Match ]]
Regex demo

Related

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6
You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo
Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

I'm trying to capture data in a web url with regex

I'm trying to build my regex to match my urls
Here are 2 example urls
category/sorganiser/bouger/escalade/offre/78934/
category/sorganiser/savourer/offre/8040/
I would like to get the number just after offre (78934 and 8040)
as well as the word just before the word offre (escalade and savourer)
I did several tests but did not pass
^category/(((\w)+/){1,3})(\d+)/?$
^category/(((\w)+/){1,3})/offre/(\d+)/?$
https://regex101.com/r/S4MTvK/1
Thank you
Instead of repeating a single word char in a group (\w)+ you can repeat 1+ word chars in a single group (\w+)
Note to not match the / before /offre as it is already matched in the iteration ^category/(?:(\w+)/){1,3}
You can repeat the capture group inside a non capture group (?: to capture the last occurrence in the iteration.
^category/(?:(\w+)/){1,3}offre/(\d+)
The pattern matches
^ Start of string
category/ Match literally
(?: Non capture group
(\w+)/ Capture group 1, match 1+ word chars and match /
){1,3} Close non capture, repeat 1-3 times and capture group 1 contains the last occurrence of 1+ word chars which is escalade or savourer
offre/ Match literally
(\d+) Capture group 2, match 1+ digits
Regex demo
To also match an optional / before the end of the sting
^category/(?:(\w+)/){1,3}offre/(\d+)/?$
Regex demo

How to add additional capture group to lookahead, lookbehind regex

I am using this regex: (?<=\[).+?(?=\]) to match data in my test string below.
This regex matches everything between my brackets. I need to also include the '1234567890ABC...' portion of my string as well. How would I do that?
This is my test string:
[one] [two] [three] 1234567890ABC...
You could make use of the \G anchor and match any char except the square brackets, or match \w+ to match only word characters.
(?:\[|\G(?!^)]\h\[?)\K[^][\s]+
(?: Non capture group
\[ Match [
| Or
\G(?!^) Assert the position at the previous match
]\h\[? Match ], horizontal whitespace char and optional [
)\K Close group and reset the match buffer
[^][\s]+ Match 1+ times any char except square brackets or whitespace char
Regex demo
You could try this pattern it's the same as the pattern you are using but it includes as well words and numbers after the brackets
(?<=\[).+?(?=\])\d+|\w+

PHP - ungreedy regular expression still a little 'greedy'

For my CMS I need replace multiline content between [?][/?] tags if it contains string %empty%, leaving untouched if %empty% mark is not found.
$a='
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
$r= preg_replace (
'/(\[\?\]).*?%empty%.*?(\[\/\?\])/s',
"REPLACED",
$a ) ;
echo $r;
Right result:
REPLACED
text
REPLACED
text
It works well in almost every combination, except if first line is unmatched. In this case is replaced all content between first [?] and last [/?]
$a='
[?]<h1>%!empty%</h1>
<p>text</p>
[/?]
text
[?]<h1>%empty%</h1>
<p>text</p>
[/?]
text';
Wrong result:
REPLACED
text
Expected:
[?]
<h1>%!empty%</h1>
<p>test</p>
[/?]
text
REPLACED
text
I am using both ungreedy and 'lazy' regular exceptions with same result. I thing that I need explicit define second [/?] in regexp, but without success.
For your current example data, if you want to match from [?] till [/?] and in between there can not be [?] and there must be %empty%, you might make use of a tempered greedy token.
Using the /s modifier to make the dot match a newline:
\[\?\](?:(?!\[/?\?\]).)*%empty%(?:(?!\[\?\]).)*\[/\?]
Explanation
\[\?\] Match [?]
(?: Non capturing group
(?!\[/?\?\]). Assert what is directly on the right is not [?] or [/?]. Then match any char.
)* Close non capturing group and repeat 0+ times
%empty% Match literally
(?: Non capturing group
(?!\[\?\]). Assert what is directly on the right is not [?]. Then match any char.
)* Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo
Edit
#Casimir et Hippolyte suggests a more performant pattern using a Unrolled Star Alternation Solution approach:
\[\?\][^[%]*+(?:\[(?!\?])[^[%]*|%(?!empty%)[^[{%]*)*+%empty%[^[]*+(?:\[(?!/?\?])[^[]*)*+\[/\?]
Explanation
\[\?\] Match [?]
[^[%]*+ Negated character class, match any char except [ ] %
(?: Non capturing group
\[(?!\?]) Match [, assert what is directly on the right is not ?]
[^[%]*If that is the case, match 0+ times any char except [ %
| Or
%(?!empty%) Match %, assert what is directly on the right is not empty%
[^[{%]* If that is the case, match 0+ times any char except [ {
)*+ Close non capturing group and repeat 0+ times using a possessive quantifier
%empty%[^[]*+ Match %empty% and 1+ times any char except [ ]
(?: Non capturing group
\[(?!/?\?]) Match [, assert what is directly on the right is not an optional / and ?]
[^[]* If that is the case, match 0+ times any char except [
)*+ Close non capturing group and repeat 0+ times
\[/\?] Match [/?]
Regex demo

Another regex: square brackets

I have something like: word[val1|val2|val3] . Need a regex to capture both: word and val1|val2|val3
Another sample: leader[77]
Result: Need a regex to capture both: leader and 77
This is what I have so far: ^(.*\[)(.*) and it gives me: array[0]=word[val1|val2|val3];
array[1]=word[
array[2]=val1|val2|val3]
array[1] is needed but without [
array[2] is needed but without ]
Any ideas? - Thank you
For the either one you can use \w*(\[.*\])
\w* match any word character [a-zA-Z0-9_]
Quantifier: * Between zero and unlimited times
\[ matches the character [ literally
.* matches any character (except newline)
\] matches the character ] literally
EDIT: I kept hammering away to get rid of the brackets and came up with (\w*)\[([^][]*)]
EDIT: Which I now see Wiktor suggested in comments before I got back with mine.
You can use
([^][]+)\[([^][]*)]
Here is the regex demo
Explanation:
([^][]+) - Group 1 matching one or more chars other than ] and [
\[ - a literal [
([^][]*) - Group 2 capturing 0+ chars other than [ and ]
] - a literal ].
See IDEONE demo:
$re = '~([^][]+)\[([^][]*)]~';
$str = "word[val1|val2|val3]";
preg_match($re, $str, $matches);
echo $matches[1]. "\n" . $matches[2];

Categories