PHP PCRE - match "nothing" - php

I'm trying to match my application uri to a set of routes, and for the default route I thought about allowing bb.com/home or bb.com/ (empty) to be the allowed options on the first uri segment, and the same for the second. I'm not sure the way I am checking for empty values is the best :
#^/?(?P<controller>([.*]{0}|home))(?:/(?P<action>([.*]{0}|test)))?/?$#uD
Notice the [.*]{0}
Is there a better way to do it?

You could make it lazy: .*?, that should match nothing every time.
Also, you don't have to have so many capture groups. You even have a numbered capture group within a named capture group. This is how I would write the expression:
^/?(?P<controller>.*?|home)/(?P<action>.*?|test)/?$
This retains the two named capture groups, but gets rid of the nested numbered capture group and also the non-capturing group which was not necessary.

By placing .* inside a character class [] you're asking to match a literal dot . and literal * instead of the dot being able to match any character (except newline) and * being able to act as a quantifier.
By using the {0} range quantifier, this matches exactly 0 times (token is being ignored). You're not going to get the results you expect and their is no need to do this either.
You could simply add the ? for a non-greedy match and remove the excess capturing groups here.
~^/?(?P<controller>.*?|home)/(?P<action>.*?|test)/?$~i
However think about how this may work, you said you wanted to allow bb.com/home, well this will also match patterns that you possibly do not want.

Related

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

PHP RegEx: a Pattern to Validate the Second Level Domain

Note: this is a theoretical question about PHP flavor of regex, not a practical question about validation in PHP. I am merely using Domain Names for lack of a better example.
"Second Level Domain" refers to the combination of letters, numbers, period signs, and/or dashes that are placed between http:// or http://www. and .com (.co, .info, .etc) .
I am only interested in second level domains that use English version of Latin alphabet.
This pattern:
[A-Za-z0-9.-]+
matches valid domain names, such as stackoverflow, StackOverflow, stackoverflow.co (as in stackoverflow.co.uk), stack-overflow, or stackoverflow123.
However, the same pattern would also match something like stack...overflow, stack---over--flow, ........ , -------- , or even . and -.
How can that pattern be rewritten, to indicate that period signs and dashes, even though they can be used multiple times in a node,
cannot be used without other symbols,
cannot be placed twice or more side by side with each other,
and cannot be placed in the beginning or end of the node?
Thank you in advance!
I think something like this should do the trick:
^([a-zA-Z0-9]+[.-])*[a-zA-Z0-9]+$
What this tries to do is
start at the beginning of string, end at the end
one or more letter or digit
followed by either dot or hypen
the group above repeated 0 or more times
followed by one or more letter or digit
Assuming that you are looking for a regex that does not allow two consecutive . or - you can use:
^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*$
regexr demo

Discard character in matching group

I have a couple of matching groups one after another in a long Regex pattern. Around the middle I have
...(?<number>(?:/(?:digit|num))?\d+|)...
which should match something like /num9, /digit9 or 9 or blank (because I need the named group to appear in the resulting associative array even if it's empty).
The pattern works, but is it possible to discard the / character if the one of first two cases is matched? I tried a positive lookahead, but it seems that you can't use those if you have expressions before the lookahead.
Is what I'm trying to accomplish possible using Regex?
Based on your input, I think that you need to capture / anyway at some point, otherwise your whole regex fails. At the same time you want to ignore it, so it cannot be a part of you named group. Therefore by putting it outside it and making it optional, while ensuring that a digit is not preceded directly by a / you come up with the desired results :
^/?(?<number>(?:(?:digit|num))?(?<!/)\d+|)$
However given your lack of a more complete input and regex, I am not 100% sure this will work for all your cases.

Regular expression .*? vs .*

I came across a php article about regular expressions which used (.*?) in its syntax. As far I can see it behaves just like (.*)
Is there any advantage of using (.*?) ? I can't really see why someone would use that.
in most flavours of regex, the *? production is a non-greedy repeat. This means that the .*? production matches first the empty string, and then if that fails, one character, and so on until the match succeeds. In contrast, the greedy production .* first attempts to match the entire input, and then if that fails, tries one character less.
This concept only applies to regular expression engines that use recursive backtracking to match ambiguous expressions. In theory, they match exactly the same sentances, but since they try different things first, it's likely that one will be much quicker than the other.
This can also be useful when capture groups (in recursive and NFA style engines equally) are used to extract information from the matching action. For instance, an expression like
"(.*?)"
can be used to capture a quoted string. Since the subgroup is non-greedy, you can be sure that no quotes will be captured, and the subgroup contains only the desired content.
.* is greedy, .*? is not. It only makes sense in context though. Given the pattern:
<br/>(.*?)<br/> and <br/>(.*)<br/>, and the input <br/>test<br/>test2<br/>,
.* will match <br/>test<br/>test2<br/>,
.*? will only match <br/>test<br/>.
Note: don't ever use regex to parse complex html.

In RegEx, how do you find a line that contains no more than 3 unique characters?

I am looping through a large text file and im looking for lines that contain no more than 3 different characters (those characters, however, can be repeated indefinitely). I am assuming the best way to do this would be some sort of regular expression.
All help is appreciated.
(I am writing the script in PHP, if that helps)
Regex optimisation fun time exercise for kids! Taking gnarf's regex as a starting point:
^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$
I noticed that there were nested and sequential *s here, which can cause a lot of backtracking. For example in 'abcaaax' it will try to match that last string of ‘a’s as a single \1* of length 3, a \1* of length two followed by a single \1, a \1 followed by a 2-length \1*, or three single-match \1s. That problem gets much worse when you have longer strings, especially when due to the regex there is nothing stopping \1 from being the same character as \2.
^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$
This was over twice as fast as the original, testing on Python's PCRE matcher. (It's quicker than setting it up in PHP, sorry.)
This still has a problem in that (.)? can match nothing, and then carry on with the rest of the match. \1|\2 will still match \1 even if there is no \2 to match, resulting in potential backtracking trying to introduce the \1|\2 and \1|\2|\3 clauses earlier when they can't result in a match. This can be solved by moving the ? optionalness around the whole of the trailing clauses:
^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$
This was twice as fast again.
There is still a potential problem in that any of \1, \2 and \3 can be the same character, potentially causing more backtracking when the expression does not match. This would stop it by using a negative lookahead to not match a previous character:
^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$
However in Python with my random test data I did not notice a significant speedup from this. Your mileage may vary in PHP dependent on test data, but it might be good enough already. Possessive-matching (*+) might have helped if this were available here.
No regex performed better than the easier-to-read Python alternative:
len(set(s))<=3
The analogous method in PHP would probably be with count_chars:
strlen(count_chars($s, 3))<=3
I haven't tested the speed but I would very much expect this to be faster than regex, in addition to being much, much nicer to read.
So basically I just totally wasted my time fiddling with regexes. Don't waste your time, look for simple string methods first before resorting to regex!
At the risk of getting downvoted, I will suggest regular expressions are not meant to handle this situation.
You can match a character or a set of characters, but you can't have it remember what characters of a set have already been found to exclude those from further match.
I suggest you maintain a character set, you reset it before you begin with a new line, and you add there elements while going over the line. As soon as the count of elements in the set exceeds 3, you drop the current line and proceed to the next.
Perhaps this will work:
preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail
Explaination:
/
^ #start of string
(.) #match any character in group 1
\\1* #match whatever group 1 was 0 or more times
(.)? #match any character in group 2 (optional)
(?:\\1*\\2*)* #match group 1 or 2, 0 or more times, 0 or more times
#(non-capture group)
(.)? #match any character in group 3 (optional)
(?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
#(non-capture group)
$ #end of string
/
An added benifit, $matches[1], [2], [3] will contain the three characters you want. The regular expression looks for the first character, then stores it and matches it up until something other than that character is found, catches that as a second character, matching either of those characters as many times as it can, catches the third character, and matches all three until the match fails or the string ends and the test passes.
EDIT
This regexp will be much faster because of the way the parsing engine and backtracking works, read bobince's answer for the explanation:
/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/
for me - as a programmer with fair-enough regular expression knowledge this sounds not like a problem that you can solve using Regexp only.
more likely you will need to build a hashMap/array data structure key: character value:count and iterate the large text file, rebuilding the map for each line. at each new character check if the already-encountered character count is 2, if so, skip current line.
but im keen to be suprised if one mad regexp hacker will come up with a solution.

Categories