PHP Regex detect repeated character in a word - php

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.

If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*

Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80

Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

Related

Regular expression to match any combination of repeated values

I need to test strings for repeated chars. Is there an singular regular expression I could use for this or should I compile a list of multiple different regular expressions?
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
There's a couple of different types of repetition
Not sure if I get you right, but maybe this regex would be what you want
^(?:(.*)\1+)*$
matches
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
By use of a capturing groups and backreference check, if string consists only by repeated values.
^(?:(\w+)\1+)+$
See demo at regex101
This is like the others, except the inner capture expression is non-greedy.
Not really sure if it maters though it insures the finest granularity.
(?:(.+?)\1+)+
It is probably impossible though to get the repeating boundary's via capture
group info.

preg_match to match a list of words but not some

I am trying to create a fairly simple regular expression to use with preg_match() used to check user agent strings for possible web crawlers/spiders.
For example, right now I am using something similar to this:
preg_match("/(bot|search|web|slurp|crawl)/i")
which seems to be successfully matching user agents that contain something like "googlebot" or "webcrawler".
However, the problem I am having is that this also matches when the user agent contains something as common as "webkit".
What modifications would be necessary to prevent specific words such as "webkit" from being matched? I have very little understanding of regular expressions and have spent hours trying various combinations based off answers to other questions and have had no success so far.
Many thanks in advance :)
In order to exclude a certain list of words, you can combine two lookaheads:
(?!webkit|robot)(?=bot|search|web|slurp|crawl)
Apparently the first part would be your exclusion list. This would match "web" but not "webkit"
A small note on the syntax. (?!regex) is negative lookahead and (?=regex) is a positive lookahead (non-consuming regular expression). You can read more upon it here.
In short, a lookahead means "match regex expr but after that continue matching at the original match-point."

Regex atomic grouping does not seem to work in preg_match_all()

I've recently been playing with regular expressions, and one thing doesn't work as expected for me when I use preg_match_all in php.
I'm using an online regex tool at http://www.solmetra.com/scripts/regex/index.php.
The regex I'm using is /(?>x|y|z)w/. I'm making it match abyxw. I am expecting it to fail, yet it succeeds, and matches xw.
I am expecting it to fail, due to the use of atomic grouping, which, from what I have read from multiple sources, prevents backtracking. What I am expecting precisely is that the engine attempts to match the y with alternation and succeeds. Later it attempts to match w with the regex literal w and fails, because it encounters x. Then it would normally backtrack, but it shouldn't in this case, due to the atomic grouping. So from what I know it should keep trying to match y with this atomic group. Yet it does not.
I would appreciate any light shed on this situation. :)
This is a little bit tricky, but there are two things that the regex can try to do when it cannot find a match:
Advance the starting position - If the match cannot succeed at an index i, it will be attempted again starting at index i+1, and this will continue until it reaches the end of the string.
Backtracking - If repetition or alternation is used in the regex, then the regex engine can discard part of an unsuccessful match and try again by using less or more of the repetition, or a different element in the alternation.
Atomic groups prevent backtracking but they do not affect advancing the starting position.
In this case, the match will fail when the engine is trying to match with y as the first character, but then it will move on and see xw as the remainder of the string, which will match.

Discard character in matching group

I have a couple of matching groups one after another in a long Regex pattern. Around the middle I have
...(?<number>(?:/(?:digit|num))?\d+|)...
which should match something like /num9, /digit9 or 9 or blank (because I need the named group to appear in the resulting associative array even if it's empty).
The pattern works, but is it possible to discard the / character if the one of first two cases is matched? I tried a positive lookahead, but it seems that you can't use those if you have expressions before the lookahead.
Is what I'm trying to accomplish possible using Regex?
Based on your input, I think that you need to capture / anyway at some point, otherwise your whole regex fails. At the same time you want to ignore it, so it cannot be a part of you named group. Therefore by putting it outside it and making it optional, while ensuring that a digit is not preceded directly by a / you come up with the desired results :
^/?(?<number>(?:(?:digit|num))?(?<!/)\d+|)$
However given your lack of a more complete input and regex, I am not 100% sure this will work for all your cases.

In RegEx, how do you find a line that contains no more than 3 unique characters?

I am looping through a large text file and im looking for lines that contain no more than 3 different characters (those characters, however, can be repeated indefinitely). I am assuming the best way to do this would be some sort of regular expression.
All help is appreciated.
(I am writing the script in PHP, if that helps)
Regex optimisation fun time exercise for kids! Taking gnarf's regex as a starting point:
^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$
I noticed that there were nested and sequential *s here, which can cause a lot of backtracking. For example in 'abcaaax' it will try to match that last string of ‘a’s as a single \1* of length 3, a \1* of length two followed by a single \1, a \1 followed by a 2-length \1*, or three single-match \1s. That problem gets much worse when you have longer strings, especially when due to the regex there is nothing stopping \1 from being the same character as \2.
^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$
This was over twice as fast as the original, testing on Python's PCRE matcher. (It's quicker than setting it up in PHP, sorry.)
This still has a problem in that (.)? can match nothing, and then carry on with the rest of the match. \1|\2 will still match \1 even if there is no \2 to match, resulting in potential backtracking trying to introduce the \1|\2 and \1|\2|\3 clauses earlier when they can't result in a match. This can be solved by moving the ? optionalness around the whole of the trailing clauses:
^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$
This was twice as fast again.
There is still a potential problem in that any of \1, \2 and \3 can be the same character, potentially causing more backtracking when the expression does not match. This would stop it by using a negative lookahead to not match a previous character:
^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$
However in Python with my random test data I did not notice a significant speedup from this. Your mileage may vary in PHP dependent on test data, but it might be good enough already. Possessive-matching (*+) might have helped if this were available here.
No regex performed better than the easier-to-read Python alternative:
len(set(s))<=3
The analogous method in PHP would probably be with count_chars:
strlen(count_chars($s, 3))<=3
I haven't tested the speed but I would very much expect this to be faster than regex, in addition to being much, much nicer to read.
So basically I just totally wasted my time fiddling with regexes. Don't waste your time, look for simple string methods first before resorting to regex!
At the risk of getting downvoted, I will suggest regular expressions are not meant to handle this situation.
You can match a character or a set of characters, but you can't have it remember what characters of a set have already been found to exclude those from further match.
I suggest you maintain a character set, you reset it before you begin with a new line, and you add there elements while going over the line. As soon as the count of elements in the set exceeds 3, you drop the current line and proceed to the next.
Perhaps this will work:
preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail
Explaination:
/
^ #start of string
(.) #match any character in group 1
\\1* #match whatever group 1 was 0 or more times
(.)? #match any character in group 2 (optional)
(?:\\1*\\2*)* #match group 1 or 2, 0 or more times, 0 or more times
#(non-capture group)
(.)? #match any character in group 3 (optional)
(?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
#(non-capture group)
$ #end of string
/
An added benifit, $matches[1], [2], [3] will contain the three characters you want. The regular expression looks for the first character, then stores it and matches it up until something other than that character is found, catches that as a second character, matching either of those characters as many times as it can, catches the third character, and matches all three until the match fails or the string ends and the test passes.
EDIT
This regexp will be much faster because of the way the parsing engine and backtracking works, read bobince's answer for the explanation:
/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/
for me - as a programmer with fair-enough regular expression knowledge this sounds not like a problem that you can solve using Regexp only.
more likely you will need to build a hashMap/array data structure key: character value:count and iterate the large text file, rebuilding the map for each line. at each new character check if the already-encountered character count is 2, if so, skip current line.
but im keen to be suprised if one mad regexp hacker will come up with a solution.

Categories