PCRE: DEFINE statement for lookarounds - php

Stepping deeper into the world of regular expressions, I came across the DEFINE Statement in PCRE.
I have the following code (which defines a lowercase, an uppercase and anA group (I know it's rather useless at this point, thanks :):
(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&anA)
Now, I wonder how I can combine the lookahead (lowercase in this example) with the anA part? Admittedly, struggled to find an appropriate documentation on the DEFINE Syntax. Here's a regex101.com fiddle.
To make it somewhat clearer, I'd like to have the opportunity to combine subroutines. For instance, with the above example (to i.e. validate a password which needs to have an A followed by B and some lowercase letters), I could do the following:
^(?=[^a-z]*[a-z]).*?A(?=B).*
How can this be done with the above subroutines?
EDIT: For reference, I ended up using the following construct:
(?(DEFINE)
(?<lc>(?=[^a-z\n]*[a-z])) # lowercase
(?<uc>(?=[^A-Z\n]*[A-Z])) # uppercase
(?<digit>(?=[^\d\n]*\d)) # digit
(?<special>(?=.*[!#]+)) # special character
)
^(?&lc)(?&uc)(?&digit)(?&special).{6,}$

How I can combine the lookahead (lowercase in this example) with the anA part
You can recurse the subpattern the same way as you have done it with anA by using the (?&lowercase) named subroutine call:
/(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&lowercase)(.*?)((?&anA)).*
/mgx
See the regex demo. Note that you need to specify the VERBOSE/IgnorePatternWhitespace/Freespace mode with /x modifier at regex101.com for this pattern to work.
Beware of a caveat though in case you want to also DEFINE the .* and .*? subpatterns (see PCRE Man Pages):
All subroutine calls, whether recursive or not, are always treated as atomic groups. That is, once a subroutine has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. Any capturing parentheses that are set during the subroutine call revert to their previous values afterwards.

Related

PHP regex subroutine reference not working

I have a simple regex like this:
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))#
It should match the following string fine:
123aaaa456bbbb789
But it doesn't.
But if I replace the subroutine reference with a direct copy of the regex:
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*)789))#
Then it works perfectly fine.
I can't figure out why referencing the pattern by the group name isn't working.
The point here is that [\s\S]* is a * quantified subpattern that allows a regex engine to backtrack once the subsequent subpatterns fail to match, but the recursion calls in PCRE are atomic, i.e. there is no way for the engine to backtrack when it grabs any 0+ chars with (?P>test), and that is why the pattern fails to match.
In short, #123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))# pattern can be re-written as
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*+)789))#
^^
and as [\s\S]*+ already matches 789, the engine cannot backtrack to match 789 pattern part.
See PCRE docs:
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.
No idea why they mention Python here since re does not support recursion (unless they meant the PyPi regex module).
If you are looking for a solution, you might use a (?:(?!789)[\s\S])* tempered greedy token instead of [\s\S]*, it will only match any char if it does not start a 789 char sequence (so, no need to backtrack to accommodate for 789):
123(?:(?:(?P<test>(?:(?!789)[\s\S])*)456(?P<test1>(?P>test))789))
^^^^^^^^^^^^^^^^^^
See this regex demo.

Optional Group Expression

Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.
(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.

Understanding Regular Expressions

I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).
The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.
$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);
Going through the pattern, my interpretation is:
\b A word boundary
\d+ A subpattern of 1 or more digits
\s+ One or more whitespaces
(?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?
Questions:
Is my above interpretation correct?
What is that lookahead assertion all about?
What is the purpose of the leading / and trailing /?
Is my above interpretation correct?
Yes, your interpretation is correct.
What is that lookahead assertion all about?
That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.
So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.
The reason that this matches is that the string abcde does in fact contain:
An a
Followed by a b
Followed by a c
Followed by a d that has an e after it (this is a single character!)
It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.
On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:
"a", followed by "b", followed by "c", followed by "d that has an f in front of it"
is not found.
What is the purpose of the leading / and trailing /
Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer # signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.
It would be a good idea to watch this video, and the 4 that follow this:
http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/
The rest of the series is found here:
http://blog.themeforest.net/?s=regex+for+dummies
A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.
Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.

What are those characters in a regular expression?

I found this regex that works correctly but I didn't understand what is # (at the start) and at the end of the expression. Are not ^ and $ the start/end characters?
preg_match_all('#^/([^/]+)/([^/]+)/$#', $s, $matches);
Thanks
The matched pattern contains many /, thus the # is used as regex delimeter. These are identical
/^something$/
and
#^something$#
If you have multiple / in your pattern the 2nd example is better suited to avoid ugly masking with \/. This is how the RE would like like with using the standard // syntax:
/^\/([^\/]+)\/([^\/]+)\/$/
About #:
That's a delimiter of the regular expression itself. It's only meaning is to tell which delimiter is used for the expression. Commonly / is used, but others are possible. PCRE expressions need a delimiter with preg_match or preg_match_all.
About ^:
Inside character classes ([...]), the ^ has the meaning of not if it's the first character.
[abc] : matching a, b or c
[^abc] : NOT matching a, b or c, match every other character instead
Also # at the start and the end here are custom regex delimiters. Instead of the usual /.../ you have #...#. Just like perl.
These are delimiters. You can use any delimiter you want, but they must appear at the start and end of the regular expression.
Please see this documentation for a detail insight in to regular expressions:
http://www.php.net/manual/en/pcre.pattern.php
You can use pretty much anything as delimiters. The most common one is /.../, but if the pattern itself contains / and you don't want to escape any and all occurrences, you can use a different delimiter. My personal preference is (...) because it reminds me that $0 of the result is the entire pattern. But you can do anything, <...>, #...#, %...%, {...}... well, almost anything. I don't know exactly what the requirements are, but I think it's "any non-alphanumeric character".
Let me break it down:
# is the first character, so this is the character used as the delimiter of the regular expression - we know we've got to the end when we reach the next (unescaped) one of these
^ outside of a character class, this means the beginning of the string
/ is just a normal 'slash' character
([^/]+) This is a bracketed expression containing at least one (+) instance of any character that isn't a / (^ at the beginning of a character class inverts the character class - meaning it will only match characters that are not in this list)
/ again
([^/]+) again
/ again
$ this matches the end of the string
# this is the final delimeter, so we know that the regex is now finished.

What is the difference between [0-9]+ and [0-9]++?

Can someone explain me what is the difference between [0-9]+ and [0-9]++?
The PCRE engine, which PHP uses for regular expressions, supports "possessive quantifiers":
Quantifiers followed by + are "possessive". They eat as many characters as possible and don't return to match the rest of the pattern. Thus .*abc matches "aabc" but .*+abc doesn't because .*+ eats the whole string. Possessive quantifiers can be used to speed up processing.
And:
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour.
The difference is thus:
/[0-9]+/ - one or more digits; greediness defined by the PCRE_UNGREEDY option
/[0-9]+?/ - one or more digits, but as few as possible (non-greedy)
/[0-9]++/ - one or more digits, but as many as possible (greedy, default)
This snippet visualises the difference when in greedy-by-default mode. Note that the first snippet is functionally the same as the last, because the additional + is (in a sense) already applied by default.
This snippet visualises the difference when applying PCRE_UNGREEDY (ungreedy-by-default mode). See how the default is reversed.
++ (and ?+, *+ and {n,m}+) are called possessive quantifiers.
Both [0-9]+ and [0-9]++ match one or more ASCII digits, but the second one will not allow the regex engine to backtrack into the match if that should become necessary for the overall regex to succeed.
Example:
[0-9]+0
matches the string 00, whereas [0-9]++0 doesn't.
In the first case, [0-9]+ first matches 00, but then backtracks one character to allow the following 0 to match. In the second case, the ++ prevents this, therefore the entire match fails.
Not all regex flavors support this syntax; some others implement atomic groups instead (or even both).

Categories