I need to test strings for repeated chars. Is there an singular regular expression I could use for this or should I compile a list of multiple different regular expressions?
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
There's a couple of different types of repetition
Not sure if I get you right, but maybe this regex would be what you want
^(?:(.*)\1+)*$
matches
111333555777
aaaabbbbccccdddd
aabbcc
11111
abcabcabc
By use of a capturing groups and backreference check, if string consists only by repeated values.
^(?:(\w+)\1+)+$
See demo at regex101
This is like the others, except the inner capture expression is non-greedy.
Not really sure if it maters though it insures the finest granularity.
(?:(.+?)\1+)+
It is probably impossible though to get the repeating boundary's via capture
group info.
Related
(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.
I am trying to create a fairly simple regular expression to use with preg_match() used to check user agent strings for possible web crawlers/spiders.
For example, right now I am using something similar to this:
preg_match("/(bot|search|web|slurp|crawl)/i")
which seems to be successfully matching user agents that contain something like "googlebot" or "webcrawler".
However, the problem I am having is that this also matches when the user agent contains something as common as "webkit".
What modifications would be necessary to prevent specific words such as "webkit" from being matched? I have very little understanding of regular expressions and have spent hours trying various combinations based off answers to other questions and have had no success so far.
Many thanks in advance :)
In order to exclude a certain list of words, you can combine two lookaheads:
(?!webkit|robot)(?=bot|search|web|slurp|crawl)
Apparently the first part would be your exclusion list. This would match "web" but not "webkit"
A small note on the syntax. (?!regex) is negative lookahead and (?=regex) is a positive lookahead (non-consuming regular expression). You can read more upon it here.
In short, a lookahead means "match regex expr but after that continue matching at the original match-point."
Just a simple regex I don't know how to write.
The regex has to make sure a string matches all 3 words. I see how to make it match any of the 3:
/advancedbrain|com_ixxocart|p\=completed/
but I need to make sure that all 3 words are present in the string.
Here are the words
advancebrain
com_ixxocart
p=completed
Use lookahead assertions:
^(?=.*advancebrain)(?=.*com_ixxochart)(?=.*p=completed)
will match if all three terms are present.
You might want to add \b work boundaries around your search terms to ensure that they are matched as complete words and not substrings of other words (like advancebraindeath) if you need to avoid this:
^(?=.*\badvancebrain\b)(?=.*\bcom_ixxochart\b)(?=.*\bp=completed\b)
^(?=.*?p=completed)(?=.*?advancebrain)(?=.*?com_ixxocart).*$
Spent too long testing and refining =/ Oh well.. Will still post my answer
Use lookahead:
(?=.*\badvancebrain)(?=.*\bcom_ixxocart)(?=.*\bp=completed)
Order won't matter. All three are required.
I came across a php article about regular expressions which used (.*?) in its syntax. As far I can see it behaves just like (.*)
Is there any advantage of using (.*?) ? I can't really see why someone would use that.
in most flavours of regex, the *? production is a non-greedy repeat. This means that the .*? production matches first the empty string, and then if that fails, one character, and so on until the match succeeds. In contrast, the greedy production .* first attempts to match the entire input, and then if that fails, tries one character less.
This concept only applies to regular expression engines that use recursive backtracking to match ambiguous expressions. In theory, they match exactly the same sentances, but since they try different things first, it's likely that one will be much quicker than the other.
This can also be useful when capture groups (in recursive and NFA style engines equally) are used to extract information from the matching action. For instance, an expression like
"(.*?)"
can be used to capture a quoted string. Since the subgroup is non-greedy, you can be sure that no quotes will be captured, and the subgroup contains only the desired content.
.* is greedy, .*? is not. It only makes sense in context though. Given the pattern:
<br/>(.*?)<br/> and <br/>(.*)<br/>, and the input <br/>test<br/>test2<br/>,
.* will match <br/>test<br/>test2<br/>,
.*? will only match <br/>test<br/>.
Note: don't ever use regex to parse complex html.
I have seen several regular expressions that have two plusses in a row. What exactly does this mean? One or more of one or more of the pattern. If the pattern matches in the first place, why would the second match be necessary?
Examples:
[a-zA-Z0-9_]++
[^/.,;?]++
They're called possessive quantifiers.