Multi line negative lookahead - php

I'm not really good with regex (i'm on this one for hours) and I struggle to replace all empty lines between 2 identifier ("{|" and "|}")
My regex look like that (sorry for your eyes) : (\{\|)((?:(?!\|\}).)+)(?:\n\n)((?:(?!\|\}).)+)(\|\})
(\{\|) : the character "{|"
((?:(?!\|\}).)+) : Everything if not after "|}" (negative lookahead)
(?:\n\n) : The empty line I want to delete
((?:(?!\|\}).)+) : Everything if not after "|}" (negative lookahead)
(\|\}) : the character "|}"
Demo
It works, but it delete only the last empty line, can you help me to make it work with all the empty lines ?
I tryed to add a negative lookahead on \n\n with a repeating group on everything but it did not work.

Several ways:
The \G based pattern: (only one pattern is needed)
$txt = preg_replace('~ (?: \G (?!\A) | \Q{|\E ) [^|\n]*+ (?s: (?! \Q|}\E | \n\n) . [^|\n]*)*+ \n \K \n+ ~x', '', $txt);
The \G matches the start of the string or the position in the string after the last successful match. This ensures that several matches are contigous.
What I call a \G based pattern can be schematized like that:
(?: \G position after a successful match | first match beginning ) reach the target \K target
The "reach the target" part is designed to never match the closing sequence |}. So once the last target is found, the \G part will fail until the first match part succeeds again.
~
### The beginning
(?:
\G (?!\A) # contigous to a successful match
|
\Q{|\E # opening sequence
#; note that you can add `[^{]* (*SKIP)` before to quickly avoid
#; all failing positions
#; note that if you want to check that the opening sequence is followed by
#; a closing sequence (without an other opening sequence), you can do it
#; here using a lookahead
)
### lets reach the target
#; note that all this part can also be written like that `(?s:(?!\|}|\n\n).)*`
#; or `(?s:[^|\n]|(?!\|}|\n\n).)*`, but I choosed the unrolled pattern that is
#; more efficient.
[^|\n]*+ # all that isn't a pipe or a newline
# eventually a character that isn't the start of |} or \n\n
(?s:
(?! \Q|}\E | \n\n ) # negative lookahead
. # the character
[^|\n]*
)*+
#; adding a `(*SKIP)` here can also be usefull if there's no more empty lines
#; until the closing sequence
### The target
\n \K \n+ # the \K is a conveniant way to define the start of the returned match
# result, this way, only \n+ is replaced (with nothing)
~x
or preg_replace_callback: (more simple)
$txt = preg_replace_callback('~\Q{|\E .*? \Q|}\E~sx', function ($m) {
return preg_replace('~\n+~', "\n", $m[0]);
}, $txt);
demos

You can use a positive lookahead pattern to ensure that a matching blank line is followed by |}, but also use a negative lookahead pattern to ensure that none of the characters between the blank line and the |} is the starting position of a {|:
\n{2,}(?=(?:(?!\{\|).)*?\|\})
Demo: https://regex101.com/r/oWfkg1/8

If you use:
(?<={\|)(\n{2,}|(\r\n){2,}|\s+)(?=\|})
Then it will match new lines and empty space found between {| and |}

Related

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6
You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo
Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

Mask email regexp

I 'm tring to do a php regexp to mask a email so that
example#gmail.com turn to e*****e#gmail.com.
$maskedEmail=preg_replace('/^*#/', '*', $email);
You may use
preg_replace('~((?!^)\G|^[^#])[^#](?=[^#]+#)~', '$1*', $s)
See the regex demo
Details
((?!^)\G|^[^#]) - Group 1: the end of the previous match or start of string and any char other than a #
[^#] - a char other than #
(?=[^#]+#) - a positive lookahead that requires 1+ chars (that + is important here, you can't use *) other than # followed with a # immediately to the right of the current location.
The replacement is the value captured in Group 1 (so that the first char is kept in the string, and then all but the last char before a # are replaced with *.
To not mask the first character you could assert what is directly to the left is not the start of the string.
To not mask the character directly before the #, you could assert that what is on the right is always one character that is not an # before matching it.
(?<!^).(?=[^#]+#)
In the replacement use:
*
Explanation
(?<!^) Negative lookbehind, assert what is directly on the left is not the start of the string
. Match any character except a newline
(?= Positive lookahead, assert what is directly on the right is
[^#]+# Match 1+ times any char except # using a negated character class, and match #
) Close positive lookahead
Regex demo | Php demo
For example
$email = "example#gmail.com";
$maskedEmail=preg_replace('/(?<!^).(?=[^#]+#)/', '*', $email);
echo $maskedEmail;
Result
e*****e#gmail.com

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

PHP: RegEx Syntax

I am a serious newbie with regular expression so please disregard my mistakes. I need to be sure that several criteria in a string are met.
Requirements:
Have at most 5 words
Max of 256 characters
Word is considered 1 or more characters - no spaces
Shouldn't contain two consecutive spaces
Example:
Tree blows in the wind
1-Tree falls over
Failure Example:
Tree blows in the night sky
Tree breaks 2 limbs during night
Can this be done in one single expression or should it be broken up?
Validating for 2 spaces:
- /^\s\s$/
Max of 256 characters:
- /^[a-zA-Z0-9]{,256}$/
I am not sure how to test case for the 5 words and combine the other criteria that I impose. Can anyone help?
Test for word:
- /^\w{1,5}$
You can try this:
(?s)\A(?!.{257}|.*\s\s)\W*\w*(?:\W+\w+){0,4}\W*\z
pattern details:
(?s) # turn on the singleline mode: allow the dot to match newlines
\A # start of the string anchor
(?! # open a negative lookahead assertion: means not followed by
.{257} # 257 characters
| # OR
.*\s\s # two consecutive whitespaces
) # close the negative lookahead
\W* # optional non-word characters
\w* # optional word characters (nothing in your requirements forbids to have a string without words or an empty string)
(?: # open a non-capturing group
\W+ # non-word characters: words are obviously separated with non-word characters
\w+ # an other word
){0,4} # repeat the non-capturing group between zero and 4 times
\W* # optional non-word characters
\z # anchor for the end of the string

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

Categories