What is the difference between [0-9]+ and [0-9]++? - php

Can someone explain me what is the difference between [0-9]+ and [0-9]++?

The PCRE engine, which PHP uses for regular expressions, supports "possessive quantifiers":
Quantifiers followed by + are "possessive". They eat as many characters as possible and don't return to match the rest of the pattern. Thus .*abc matches "aabc" but .*+abc doesn't because .*+ eats the whole string. Possessive quantifiers can be used to speed up processing.
And:
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour.
The difference is thus:
/[0-9]+/ - one or more digits; greediness defined by the PCRE_UNGREEDY option
/[0-9]+?/ - one or more digits, but as few as possible (non-greedy)
/[0-9]++/ - one or more digits, but as many as possible (greedy, default)
This snippet visualises the difference when in greedy-by-default mode. Note that the first snippet is functionally the same as the last, because the additional + is (in a sense) already applied by default.
This snippet visualises the difference when applying PCRE_UNGREEDY (ungreedy-by-default mode). See how the default is reversed.

++ (and ?+, *+ and {n,m}+) are called possessive quantifiers.
Both [0-9]+ and [0-9]++ match one or more ASCII digits, but the second one will not allow the regex engine to backtrack into the match if that should become necessary for the overall regex to succeed.
Example:
[0-9]+0
matches the string 00, whereas [0-9]++0 doesn't.
In the first case, [0-9]+ first matches 00, but then backtracks one character to allow the following 0 to match. In the second case, the ++ prevents this, therefore the entire match fails.
Not all regex flavors support this syntax; some others implement atomic groups instead (or even both).

Related

How to replace every invalid character in a string with a different one in PHP? [duplicate]

What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.

regular expression match ending III or II or I (php) [duplicate]

This question already has answers here:
Greedy vs. Reluctant vs. Possessive Qualifiers
(7 answers)
Closed 3 years ago.
I tried to use regular expression (php) to match ending Roman Numerals. For simplicity, consider example below:
$str="Olympic III";
preg_match("#^(.*)(III|II|I)$#",$str,$rep);
print_r($rep);
That will only matches a single "I". The correct answer is for me to use ungreedy "U" modifier. But why? Doesn't regular expression use the order I provided (try "III" first before try "II" or "I")?
Let us first understand what the \U is doing. It makes the quantifiers (in your case, the * in the first capturing group) lazy by default.
Your regex is equivalent to (.*?)(III|II|I) without the Ungreedy flag, which matches as you would expect it to.
With (.*)(III|II|I) what you are actually asking the regex engine is to use quantifiers greedily, i.e., match whatever they can for as long as they can. Since your alternation allows to accept either III, II, or I, the first capturing group, since it is acting greedily, consumes up the most, and leaves the smallest part for the second group which contains the alternation.
.* match the most character before (III|II|I) and (III|II|I) can only match one character you can use this regex sample ^(.*)\s(I+)$
Try this:
$str="Olympic III";
preg_match("#^(.*)\s(I+)$#",$str,$rep);
print_r($rep);
PHP Sandbox
\s before (I+) or (III|II|I) matches single whitespace and it solves your problem because it forces regexp to match (.*) only to start of interesting part.

PCRE: DEFINE statement for lookarounds

Stepping deeper into the world of regular expressions, I came across the DEFINE Statement in PCRE.
I have the following code (which defines a lowercase, an uppercase and anA group (I know it's rather useless at this point, thanks :):
(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&anA)
Now, I wonder how I can combine the lookahead (lowercase in this example) with the anA part? Admittedly, struggled to find an appropriate documentation on the DEFINE Syntax. Here's a regex101.com fiddle.
To make it somewhat clearer, I'd like to have the opportunity to combine subroutines. For instance, with the above example (to i.e. validate a password which needs to have an A followed by B and some lowercase letters), I could do the following:
^(?=[^a-z]*[a-z]).*?A(?=B).*
How can this be done with the above subroutines?
EDIT: For reference, I ended up using the following construct:
(?(DEFINE)
(?<lc>(?=[^a-z\n]*[a-z])) # lowercase
(?<uc>(?=[^A-Z\n]*[A-Z])) # uppercase
(?<digit>(?=[^\d\n]*\d)) # digit
(?<special>(?=.*[!#]+)) # special character
)
^(?&lc)(?&uc)(?&digit)(?&special).{6,}$
How I can combine the lookahead (lowercase in this example) with the anA part
You can recurse the subpattern the same way as you have done it with anA by using the (?&lowercase) named subroutine call:
/(?(DEFINE)
(?<lowercase>(?=[^a-z]*[a-z])) # lowercase
(?<uppercase>(?=[^A-Z]*[A-Z])) # uppercase
(?<anA>A(?=B))
)
^(?&lowercase)(.*?)((?&anA)).*
/mgx
See the regex demo. Note that you need to specify the VERBOSE/IgnorePatternWhitespace/Freespace mode with /x modifier at regex101.com for this pattern to work.
Beware of a caveat though in case you want to also DEFINE the .* and .*? subpatterns (see PCRE Man Pages):
All subroutine calls, whether recursive or not, are always treated as atomic groups. That is, once a subroutine has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. Any capturing parentheses that are set during the subroutine call revert to their previous values afterwards.

Optional Group Expression

Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.
(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.

Regex question mark

To match a string with pattern like:
-TEXT-someMore-String
To get -TEXT-, I came to know that this works:
/-(.+?)-/ // -TEXT-
As of what I know, ? makes preceding token as optional as in:
colou?r matches both colour and color
I initially put in regex to get -TEXT- part like this:
/-(.+)-/
But it gave -TEXT-someMore-.
How does adding ? stops regex to get the -TEXT- part correctly? Since it used to make preceding token optional not stopping at certain point like in above example ?
As you say, ? sometimes means "zero or one", but in your regex +? is a single unit meaning "one or more — and preferably as few as possible". (This is in contrast to bare +, which means "one or more — and preferably as many as possible".)
As the documentation puts it:
However, if a quantifier is followed by a question mark,
then it becomes lazy, and instead matches the minimum
number of times possible, so the pattern /\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of
question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in \d??\d which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.
Alternatively, you can use Ungreedy modifier to set the whole regular expression to search for preferably as short as possible match:
/-(.+)-/U
? before a token is shorthand for {0,1}, which means: Anything up from 0 to 1 appearances as the foremost.
But + is not a token, but a quantifier. shorthand for {1,}: 1 up to endless appearances.
A ? after a quantifier sets it into nongreedy mode. If in greedy mode, it matches as much of the string as possible. If non greedy it matches as little as possible
Another, perhaps the underlying error in your regex is that you try to match a number of arbitrary characters via .+?. However, what you really want is probably: "any character except -". You can get that via [^-]+ In this case, it doesn't matter if you do a greedy match or not -- the repeated match will terminate as soon as you encounter the second "-" in your string.

Categories