I have a simple regex like this:
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))#
It should match the following string fine:
123aaaa456bbbb789
But it doesn't.
But if I replace the subroutine reference with a direct copy of the regex:
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*)789))#
Then it works perfectly fine.
I can't figure out why referencing the pattern by the group name isn't working.
The point here is that [\s\S]* is a * quantified subpattern that allows a regex engine to backtrack once the subsequent subpatterns fail to match, but the recursion calls in PCRE are atomic, i.e. there is no way for the engine to backtrack when it grabs any 0+ chars with (?P>test), and that is why the pattern fails to match.
In short, #123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))# pattern can be re-written as
#123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*+)789))#
^^
and as [\s\S]*+ already matches 789, the engine cannot backtrack to match 789 pattern part.
See PCRE docs:
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.
No idea why they mention Python here since re does not support recursion (unless they meant the PyPi regex module).
If you are looking for a solution, you might use a (?:(?!789)[\s\S])* tempered greedy token instead of [\s\S]*, it will only match any char if it does not start a 789 char sequence (so, no need to backtrack to accommodate for 789):
123(?:(?:(?P<test>(?:(?!789)[\s\S])*)456(?P<test1>(?P>test))789))
^^^^^^^^^^^^^^^^^^
See this regex demo.
Related
I have a simple string:
$string = '--#--%--%2B--';
I want to percent-encode all characters (inclusive the "lonely" %), except the - character and the triplets of the form %xy. So I wrote the following pattern alternatives:
$pattern1 = '/(?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
$pattern2 = '/(?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
Please notice the use of (multiple) (*SKIP)(*FAIL) and of (?:).
The result of matching and replacing is the same - and the correct one too:
--%23--%25--%2B--
I would like to ask:
Are the two patterns equivalent? If not, which one whould be the proper one to use for url-encoding? Could you please explain in few words, why?
Would you suggest other alternatives (implying backtracking control verbs), or are my patterns a good choice?
Can I apply only one (?:) around the whole (chosen) pattern, even if the (multiple) (*SKIP)(*FAIL) will be inside it?
I know that I request a little too much from you by asking more questions at once. Please accept my apology! Thank you very much.
P.S: I've tested with the following PHP code:
$result = preg_replace_callback($patternX, function($matches) {
return rawurlencode($matches[0]);
}, $string);
echo $result;
First of all, both the patterns leverage the SKIP-FAIL PCRE verb sequence that is quite a well-known "trick" to match some text and skip it. See How do (*SKIP) or (*F) work on regex? for some more details.
The two patterns yield the same results, (?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) matches either [\-]+ or %[A-Fa-f0-9]{2} and then skips the match, and (?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) first tries to match [\-]+ and skips it if found, and then tries to match %[A-Fa-f0-9]{2} and skips the match if it is found. The (?:...) non-capturing groups in the second pattern are redundant as there is no alternation inside and the groups are not quantified. You may use any number of (*SKIP)(*FAIL) in your pattern, just make sure you use them before the | to skip the relevant match.
SKIP-FAIL technique is used when you want to match some text in specific context, when a char should be skipped/"avoided" if it is preceded and followed with some chars, or when you need to "avoid" matching a whole sequence of chars, like in this scenario, thus, the SKIP-FAIL is good to use.
Stepping deeper into the world of regular expressions, I came across the DEFINE Statement in PCRE.
I have the following code (which defines a lowercase, an uppercase and anA group (I know it's rather useless at this point, thanks :):
(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&anA)
Now, I wonder how I can combine the lookahead (lowercase in this example) with the anA part? Admittedly, struggled to find an appropriate documentation on the DEFINE Syntax. Here's a regex101.com fiddle.
To make it somewhat clearer, I'd like to have the opportunity to combine subroutines. For instance, with the above example (to i.e. validate a password which needs to have an A followed by B and some lowercase letters), I could do the following:
^(?=[^a-z]*[a-z]).*?A(?=B).*
How can this be done with the above subroutines?
EDIT: For reference, I ended up using the following construct:
(?(DEFINE)
(?<lc>(?=[^a-z\n]*[a-z])) # lowercase
(?<uc>(?=[^A-Z\n]*[A-Z])) # uppercase
(?<digit>(?=[^\d\n]*\d)) # digit
(?<special>(?=.*[!#]+)) # special character
)
^(?&lc)(?&uc)(?&digit)(?&special).{6,}$
How I can combine the lookahead (lowercase in this example) with the anA part
You can recurse the subpattern the same way as you have done it with anA by using the (?&lowercase) named subroutine call:
/(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&lowercase)(.*?)((?&anA)).*
/mgx
See the regex demo. Note that you need to specify the VERBOSE/IgnorePatternWhitespace/Freespace mode with /x modifier at regex101.com for this pattern to work.
Beware of a caveat though in case you want to also DEFINE the .* and .*? subpatterns (see PCRE Man Pages):
All subroutine calls, whether recursive or not, are always treated as atomic groups. That is, once a subroutine has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. Any capturing parentheses that are set during the subroutine call revert to their previous values afterwards.
Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.
(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.
I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).
The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.
$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);
Going through the pattern, my interpretation is:
\b A word boundary
\d+ A subpattern of 1 or more digits
\s+ One or more whitespaces
(?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?
Questions:
Is my above interpretation correct?
What is that lookahead assertion all about?
What is the purpose of the leading / and trailing /?
Is my above interpretation correct?
Yes, your interpretation is correct.
What is that lookahead assertion all about?
That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.
So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.
The reason that this matches is that the string abcde does in fact contain:
An a
Followed by a b
Followed by a c
Followed by a d that has an e after it (this is a single character!)
It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.
On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:
"a", followed by "b", followed by "c", followed by "d that has an f in front of it"
is not found.
What is the purpose of the leading / and trailing /
Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer # signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.
It would be a good idea to watch this video, and the 4 that follow this:
http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/
The rest of the series is found here:
http://blog.themeforest.net/?s=regex+for+dummies
A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.
Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.
I'm trying to understand a piece of code and came across this regular expression used in PHP's preg_replace function.
'/(?<!-)color[^{:]*:[^{#]*$/i'
This bit... (?<!-)
doesnt appear in any of my reg-exp manuals. Anyone know what this means please? (Google doesnt return anything - I dont think symbols work in google.)
The ?<! at the start of a parenthetical group is a negative lookbehind. It asserts that the word color (strictly, the c in the engine) was not preceded by a - character.
So, for a more concrete example, it would match color in the strings:
color
+color
someTextColor
But it will fail on something like -color or background-color. Also note that the engine will not technically "match" whatever precedes the c, it simply asserts that it is not a hyphen. This can be an important distinction depending on the context (illustrated on Rubular with a trivial example; note that only the b in the last string is matched, not the preceding letter).
PHP uses perl compatible regular expressions (PCRE) for the preg_* functions. From perldoc perlre:
"(?<!pattern)"
A zero-width negative look-behind assertion. For example
"/(?<!bar)foo/" matches any occurrence of "foo" that does
not follow "bar". Works only for fixed-width look-
behind.
I'm learning regular expressions using Python's re module!
http://docs.python.org/library/re.html
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.