Regular expression .*? vs .* - php

I came across a php article about regular expressions which used (.*?) in its syntax. As far I can see it behaves just like (.*)
Is there any advantage of using (.*?) ? I can't really see why someone would use that.

in most flavours of regex, the *? production is a non-greedy repeat. This means that the .*? production matches first the empty string, and then if that fails, one character, and so on until the match succeeds. In contrast, the greedy production .* first attempts to match the entire input, and then if that fails, tries one character less.
This concept only applies to regular expression engines that use recursive backtracking to match ambiguous expressions. In theory, they match exactly the same sentances, but since they try different things first, it's likely that one will be much quicker than the other.
This can also be useful when capture groups (in recursive and NFA style engines equally) are used to extract information from the matching action. For instance, an expression like
"(.*?)"
can be used to capture a quoted string. Since the subgroup is non-greedy, you can be sure that no quotes will be captured, and the subgroup contains only the desired content.

.* is greedy, .*? is not. It only makes sense in context though. Given the pattern:
<br/>(.*?)<br/> and <br/>(.*)<br/>, and the input <br/>test<br/>test2<br/>,
.* will match <br/>test<br/>test2<br/>,
.*? will only match <br/>test<br/>.
Note: don't ever use regex to parse complex html.

Related

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

What does "?>" mean in a PCRE regex?

I can't seem to figure out what ?> is used for in a regular expression. For example, the following:
(?>[^()]+)
I know that ?: means that it shouldn't store the match if you're not going to back reference the match. Is this somehow related?
Is this also related to a regular expression? (?P>name) or (?&name)
Source: http://php.net/manual/en/regexp.reference.recursive.php
(?>pattern) prevents backtracking on pattern. It has at least 2 names: non-backtracking group, atomic group. I will refer to it as non-backtracking group, since it is the most descriptive name.
The expression (?>[^()]+) alone doesn't need to be made non-backtracking, though. There is nothing that can induce a backtrack to show non-backtracking behavior.
A more interesting example would be the regex ^\((?>[^()]+)\), matching against the string (a + b(), compare to the normal version without the non-backtracking group ^\([^()]+\).
The normal version, after trying (a + b for ^\([^()]+ and fail to match literal ) will backtrack by one character and retry with (a +, etc. until (a, where it fails after exhausting all possibilities.
The non-backtracking version will fail the match right after its first try with (a + b.
Non-backtracking group is mostly useful to reduce backtracking induced by quantifiers (?, *, +, {n,}, {n,m}). The trick to optimization with non-backtracking group is to know the first try made by the regex engine. You may need to shift the regex around to make sure the first try made by the engine is what you want to match - then it can be made non-backtracking.
As an example of optimization with non-backtracking group:
How can I improve the performance of a .NET regular expression?
The question I quote is from .NET, but it uses the same syntax for non-backtracking group.
In the question above, the original regex has many usages of * and + quantifiers. It induces unnecessary backtracking when the match fails, which affect the performance on large input.
StackOverflowError when matching large input using RegEx
Another example. Note that possessive quantifier (adding + after normal quantifier, e.g. ?+, ++, *+, etc.) and non-backtracking group has the same behavior of non-backtracking, just the syntax of non-backtracking group allows it to be generalized.
You won't get stack overflow in PHP like in Java, but the performance should be better when you validate a long string.

Regex atomic grouping does not seem to work in preg_match_all()

I've recently been playing with regular expressions, and one thing doesn't work as expected for me when I use preg_match_all in php.
I'm using an online regex tool at http://www.solmetra.com/scripts/regex/index.php.
The regex I'm using is /(?>x|y|z)w/. I'm making it match abyxw. I am expecting it to fail, yet it succeeds, and matches xw.
I am expecting it to fail, due to the use of atomic grouping, which, from what I have read from multiple sources, prevents backtracking. What I am expecting precisely is that the engine attempts to match the y with alternation and succeeds. Later it attempts to match w with the regex literal w and fails, because it encounters x. Then it would normally backtrack, but it shouldn't in this case, due to the atomic grouping. So from what I know it should keep trying to match y with this atomic group. Yet it does not.
I would appreciate any light shed on this situation. :)
This is a little bit tricky, but there are two things that the regex can try to do when it cannot find a match:
Advance the starting position - If the match cannot succeed at an index i, it will be attempted again starting at index i+1, and this will continue until it reaches the end of the string.
Backtracking - If repetition or alternation is used in the regex, then the regex engine can discard part of an unsuccessful match and try again by using less or more of the repetition, or a different element in the alternation.
Atomic groups prevent backtracking but they do not affect advancing the starting position.
In this case, the match will fail when the engine is trying to match with y as the first character, but then it will move on and see xw as the remainder of the string, which will match.

Explain Regular Expression

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.
. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.
Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

Double plusses in regex

I have seen several regular expressions that have two plusses in a row. What exactly does this mean? One or more of one or more of the pattern. If the pattern matches in the first place, why would the second match be necessary?
Examples:
[a-zA-Z0-9_]++
[^/.,;?]++
They're called possessive quantifiers.

Categories