I need a small help with regex for the following
Alphanumeric with only lower case alphabets allowed
Starts with number or alphabet
Allows period (.)
Doesn't allow consecutive periods No ..
Doesn't allow any other special characters
Thanks,
-GM
^(?![^.]*\.\.)[a-z0-9][a-z0-9.]*$
The negative lookahead at the beginning covers your 4th requirement, everything else should be pretty straightforward. ^ and $ are beginning and end of string anchors, the character classes enforce the requirement that only lowercase letters, numbers, and . are allowed.
To add the length constraint (between 6 and 16 characters) just change the * to {5,15}. * means "repeat the previous element zero or more times", {n,m} means "repeat the previous element between n and m times (inclusive)". The reason {5,15} is used instead of {6,16} is that one character is already consumed by the first character class. Here is the end result:
^(?![^.]*\.\.)[a-z0-9][a-z0-9.]{5,15}$
Here's some assistance without giving away the answer, as you'll learn the most.
To match from a certain combination of characters, e.g. alphanumeric, use character classes, e.g. [a-z0-9]. Note that this expression matches exactly one character. You must use quantifiers to match more than one, e.g. +.
To "start" or "end" with something, you must use anchors, ^ and $, before the first or after the last character, respectively. (Watch out, though. In a character class, the ^ inverts the character class.)
In regex, . has a special meaning as a wildcard (matching any character besides newline characters). Therefore you have to escape them, \., to select the literal dot. Another way to escape the dot is to put it in a character class: [.].
Non-consecutiveness is trickier. You will need to look up more information about negative lookahead assertions (or lookaround assertions in general).
All the bolded words are terms you can Google to learn.
I'd say something along those lines: /^[a-z0-9]+(\.[a-z0-9]+)*\.?$/ (suppose that the line can end with a period)
Use this if the string may not end with a period:
/^[a-z0-9]+(\.[a-z0-9]+)*$/
or this if it may:
/^[a-z0-9]+(\.[a-z0-9]+)*\.?$/
This should be the best
^([a-z0-9]+\.?)+$
Related
I've run preg_quote('<>') to check if these characters need to be escaped in a regular expression, and to my surprise, they came back escaped: \<\>.
Why do these characters need to be escaped? What is their meaning in a regular expression?
< has significance when used to define lookbehinds
((?<!foo)bar matches bar that is not preceded by foo)
Both < and > are used to name subpatterns, like so:
preg_match("/(?<area>\d{3})-(?<sub>\d{3})-(?<num>\d{4})/",$number,$m);
// now elements of the US phone number are in $m['area'], $m['sub'] and $m['num']
So, because they can have significance when used in conjunction with other symbols, they are escaped.
It should be noted, however, that they have no meaning outside of a specific place in a subpattern, so if you're escaping manually you most likely won't need to escape them.
To expand further:
The documentation has a full list of characters that are escaped. Here I will list them, along with their meanings.
. Match any single character, other than newlines (unless the s modifier is set)
\ Escape the following character, or begin an escape sequence
+ Match one or more of the preceding character, class, or subpattern
* Match zero or more of the preceding character, class, or subpattern
? Makes the previous item optional, also used in subpatterns to define special behaviours such as "don't capture" ((?:foo)), "lookahead" ((?=foo) and (?!foo)), "lookbehind" ((?<=foo) and (?<!foo)), and many other uses besides.
[ and ] Define a character class, ie. a set of characters that may be matched. Most other symbols don't have meaning inside character classes.
^ and $ Match the start and end of the string respecively. When the m modifier is present, it also matches the start and end of individual lines.
( and ) Define a subpattern, used alone for capturing or with ? for special behaviour. Also useful for applying quantifiers, such as in \d{1,3}(?:,\d{3})* to match thousand-separated numbers.
{ and } Manually quantify the previous item. Takes one or two numbers, separated by a comma. Examples include {3} to match exactly three times, {,3} to match zero to three times, {3,} to match three or more times, and {3,8} to match three to eight times.
= Used in lookahead assertions: foo(?=bar) matches foo, but only if it is followed by bar.
! Used in negative lookaround assertions: foo(?!bar) matches foo, but not if it is followed by bar.
< and > The subject of this question, see the start of the answer for info.
| Alternation, specifying a list of possibilities. It's kind of like a character class but for entire patterns instead of single characters. foo|bar matches "foo" or "bar". May also be seen as a special behaviour in subpatterns: (?|foo(bar)|bar(foo)) ensures that whatever bit falls in the parentheses will be in subpattern 1 (otherwise, bar would be in 1 if matched, foo would be in 2 if matched, and the unmatched one would be empty)
: Used in subpatterns to make them non-capturing. Essentially, the subpattern just becomes a "group of characters", which will typically be quantified. (?:foo) matches, but does not capture, "foo".
- Defines a range of characters in a character class. Has no meaning outside of one.
when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/
Begins with alphanumeric ^[a-z0-9]
Then followed by this optional dot \.?
If there is a dot, then it MUST be followed by 2 to 4 alphabets [a-z]{2,4}
It must be ends with an alphabet [a-z]$
It has to be a dot and only two dots max.
it's like domain names:
yahoo.co.uk or yahoo.com, but you cannot do this yahoo.co.u or this yahoo.co., yes something like that.
You can group the optional dot with the 2-4 characters that must follow it: (\.[a-z]{2,4}). That said, you will have either none, or up to two of these groups of dot + alphabetic characters (\.[a-z]{2,4}){0,2}.
The must end with [a-z] part, you can check with a positive lookbehind (?<=[a-z]) giving this as the full regex:
^[a-z0-9]+(\.[a-z]{2,4}){0,2}(?<=[a-z])$
This will work in Perl and PHP regexes (PCRE), but not in JavaScript, because it does not support lookbehind. In this specific case, you can work around that limitation.
If there is at least one dot, there's already a guarantee that it will end in [a-z], because that test is in the group that the dot is a part of. If there is no dot, you need to force a [a-z] at the end. To do this you can turn the one-or-more quantifier (+) into a zero-or-more (*) and force the end to be an [a-z] when there are no "dot groups". When there are dot groups, you can keep the same pattern, but now with at least one mandatory dot.
^([a-z0-9]*[a-z]|[a-z0-9](+\.[a-z]{2,4}){1,2})$
This checks for a string that begins with [a-z][0-9] and then contains one or two dots followed by 2/4 alphabets. It works (in Python, at least) for the examples you provided (true for yahoo.co.uk and yahoo.com, false for yahoo.co.u and yahoo.co.)
^[a-z0-9]+(\.[a-z]{2,4}){1,2}$
Edit - upon re-reading, I think you may want this instead:
^[a-z0-9]*([a-z0-9](\.[a-z]{2,4}){1,2}$|[a-z]$)
This will match strings (in addition to the above) that do not include dots but end with a letter, such as yahoo, but not yahoo2.
Try this:
^[a-z0-9](\.[a-z]{2,4}|.*[a-z]$)
^[a-z0-9](?=[^.]*\.[^.]+$|[^.]*\.[^.]\.[^.]+$)(\.(?=[a-z][a-z]){1,2}).*[a-z]$
This question already has answers here:
How to validate an email address in PHP
(15 answers)
Closed 2 years ago.
Regex is blowing my mind. How can I change this to validate emails with a plus sign? so I can sign up with test+spam#gmail.com
if(!preg_match("/^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*$/i", $_GET['em'])) {
It seems like you aren't really familiar with what your regex is doing currently, which would be a good first step before modifying it. Let's walk through your regex using the email address john.robert.smith#mail.com (in each section below, the bolded part is what is matched by that section):
^ is the start of string
anchor.
It specifies that any match must
begin at the beginning of the
string. If the pattern is not
anchored, the regex engine can match
a substring, which is often
undesired.
Anchors are zero-width, meaning that
they do not capture any characters.
[_a-z0-9-]+ is made up of two
elements, a character
class
and a repetition
modifer:
[...] defines a character class, which tells the regex engine,
any of these characters are valid matches. In this case the class
contains the characters a-z, numbers
0-9 and the dash and underscore (in
general, a dash in a character class
defines a range, so you can use
a-z instead of
abcdefghijklmnopqrstuvwxyz; when
given as the last character in the
class, it acts as a literal dash).
+ is a repetition modifier that specifies that the preceding token
(in this case, the character class)
can be repeated one or more times.
There are two other repetition
operators: * matches zero or more
times; ? matches exactly zero or
one times (ie. makes something
optional).
(captures
john.robert.smith#mail.com)
(\.[_a-z0-9-]+)* again contains a
repeated character class. It also
contains a
group,
and an escaped character:
(...) defines a group, which allows you to group multiple tokens
together (in this case, the group
will be repeated as a
whole).Let's say we wanted to
match 'abc', zero or more times (ie.
abcabcabc matches, abcccc doesn't).
If we tried to use the pattern
abc*, the repetition modifier
would only apply to the c, because
c is the last token before the
modifier. In order to get around
this, we can group abc ((abc)*),
in which case the modifier would
apply to the entire group, as if it
was a single token.
\. specifies a literal dot character. The reason this is needed
is because . is a special
character in regex, meaning any
character.
Since we want to match an actual dot
character, we need to escape it.
(captures
john.robert.smith#mail.com)
# is not a special character in
regex, so, like all other
non-special characters, it matches
literally.
(captures john.robert.smith#mail.com)
[a-z0-9-]+ again defines a repeated character class, like item #2 above.
(captures john.robert.smith#mail.com)
(\.[a-z0-9-]+)* is almost exactly the same pattern as #3 above.
(captures john.robert.smith#mail.com)
$ is the end of string anchor. It works the same as ^ above, except matches the end of the string.
With that in mind, it should be a bit clearer how to add a section with captures a plus segment. As we saw above, + is a special character so it has to be escaped. Then, since the + has to be followed by some characters, we can define a character class with the characters we want to match and define its repetition. Finally, we should make the whole group optional because email addresses don't need to have a + segment:
(\+[a-z0-9-]+)?
When inserted into your regex, it'd look like this:
/^[_a-z0-9-]+(\.[_a-z0-9-]+)*(\+[a-z0-9-]+)?#[a-z0-9-]+(\.[a-z0-9-]+)*$/i
Save your sanity. Get a pre-made PHP RFC 822 Email address parser
I've used this regex to validate emails, and it works just fine with emails that contain a+:
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/
\+ will match a literal + sign, but be aware: You still won't be close to matching all possible email addresses according to the RFC spec, because the actual regex for that is madness. It's almost certainly not worth it; you should use a real email parser for this.
This is another solution (is similar to the solution found by David):
//Escaped for .Net
^[_a-zA-Z0-9-]+((\\.[_a-zA-Z0-9-]+)*|(\\+[_a-zA-Z0-9-]+)*)*#[a-zA-Z0-9-]+(\\.[a-zA-Z0-9-]+)*(\\.[a-zA-Z]{2,4})$
//Native
^[_a-zA-Z0-9-]+((\.[_a-zA-Z0-9-]+)*|(\+[_a-zA-Z0-9-]+)*)*#[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$
This is the another solution
/^[_a-z0-9-+]+(\.[_a-z0-9-+]+)*(\+[a-z0-9-]+)?#[a-z0-9-.]+(\.[a-z0-9]+)$/
or For razor page(#=\u0040)
/^[_a-z0-9-+]+(\.[_a-z0-9-+]+)*(\+[a-z0-9-]+)?\u0040[a-z0-9-.]+(\.[a-z0-9]+)$/
I'm trying to match all occurances of "string" in something like the following sequence except those inside ##
as87dio u8u u7o #string# ou os8 string os u
i.e. the second occurrence should be matched but not the first
Can anyone give me a solution?
You can use negative lookahead and lookbehind:
(?<!#)string(?!#)
EDIT
NOTE: As per Marks comments below, this would not match #string or string#.
You can try:
(?:[^#])string(?:[^#])
OK,
If you want to NOT match a character you put it in a character class (square brackets) and start it with the ^ character which negates it, for example [^a] means any character but a lowercase 'a'.
So if you want NOT at-sign, followed by string, followed by another NOT at-sign, you want
[^#]string[^#]
Now, the problem is that the character classes will each match a character, so in your example we'd get " string " which includes the leading and trailing whitespace. So, there's another construct that tells you not to match anything, and that is parens with a ?: in the beginning. (?: ). So you surround the ends with that.
(?:[^#])string(?:[^#])
OK, but now it doesn't match at the start of string (which, confusingly, is the ^ character doing double-duty outside a character class) or at the end of string $. So we have to use the OR character | to say "give me a non-at-sign OR start of string" and at the end "give me an non-at-sign OR end of string" like this:
(?:[^#]|^)string(?:[^#]|$)
EDIT: The negative backward and forward lookahead is a simpler (and clever) solution, but not available to all regular expression engines.
Now a follow-up question. If you had the word "astringent" would you still want to match the "string" inside? In other words, does "string" have to be a word by itself? (Despite my initial reaction, this can get pretty complicated :) )