Double plusses in regex - php

I have seen several regular expressions that have two plusses in a row. What exactly does this mean? One or more of one or more of the pattern. If the pattern matches in the first place, why would the second match be necessary?
Examples:
[a-zA-Z0-9_]++
[^/.,;?]++

They're called possessive quantifiers.

Related

Regex lookahead why

For PCRE, what's the difference between the following two regexes?
(?=<!--)([\s\S]*?-->) and
(<!--[\s\S]*?-->)
The first one is to match HTML comments mentioned HERE
These two patterns will match the same thing. Here is an explanation of the first pattern:
(?=<!--) assert that what immediately follows is <!--
([\s\S]*?-->) then capture everything, across lines if necessary,
until reaching the first -->
The second pattern does not use lookaheads, but rather just matches a single HTML comment:
(<!--[\s\S]*?-->)
Again, this pattern will match across lines.
I would expect both patterns to have a similar performance. What you choose would depend on which performs better for your data, the tool you use (not all regex engines support lookarounds), and which pattern you find easier to read.

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

Regular expression to match any combination of repeated values

I need to test strings for repeated chars. Is there an singular regular expression I could use for this or should I compile a list of multiple different regular expressions?
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
There's a couple of different types of repetition
Not sure if I get you right, but maybe this regex would be what you want
^(?:(.*)\1+)*$
matches
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
By use of a capturing groups and backreference check, if string consists only by repeated values.
^(?:(\w+)\1+)+$
See demo at regex101
This is like the others, except the inner capture expression is non-greedy.
Not really sure if it maters though it insures the finest granularity.
(?:(.+?)\1+)+
It is probably impossible though to get the repeating boundary's via capture
group info.

Explain Regular Expression

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.
. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.
Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

Regular expression .*? vs .*

I came across a php article about regular expressions which used (.*?) in its syntax. As far I can see it behaves just like (.*)
Is there any advantage of using (.*?) ? I can't really see why someone would use that.
in most flavours of regex, the *? production is a non-greedy repeat. This means that the .*? production matches first the empty string, and then if that fails, one character, and so on until the match succeeds. In contrast, the greedy production .* first attempts to match the entire input, and then if that fails, tries one character less.
This concept only applies to regular expression engines that use recursive backtracking to match ambiguous expressions. In theory, they match exactly the same sentances, but since they try different things first, it's likely that one will be much quicker than the other.
This can also be useful when capture groups (in recursive and NFA style engines equally) are used to extract information from the matching action. For instance, an expression like
"(.*?)"
can be used to capture a quoted string. Since the subgroup is non-greedy, you can be sure that no quotes will be captured, and the subgroup contains only the desired content.
.* is greedy, .*? is not. It only makes sense in context though. Given the pattern:
<br/>(.*?)<br/> and <br/>(.*)<br/>, and the input <br/>test<br/>test2<br/>,
.* will match <br/>test<br/>test2<br/>,
.*? will only match <br/>test<br/>.
Note: don't ever use regex to parse complex html.

Categories