Related
I'm trying to create a PHP PCRE regex that is (almost) fully compatible with RFC5321 and 5322 to test email addresses. The only thing I don't require is the (comment) part. I've seen some other attempts at this posted on here, but when I run tests vs. them they don't all work.
I have been working on one that is very close:
^(([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})|("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}"))#(([\w\-]*\.?[\w\-]*)|(\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])|(\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\]))$
To break it down:
Local part:
(
Match at most 64 of the allowed characters
([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})
|
OR match the same set of characters in a quoted string:
("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}")
)
end local part.
match # sign
#
match domain part:
(
match domain part using allowed characters:
([\w\-]*\.?[\w\-]*)
or ipv4 (it doesn't check to make sure they are < 255 - that would be handled elsewhere)
(\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])
or ipv6
(\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\])
)
The only thing it's missing is the ability to check for multiple consecutive .'s (periods) that are outside a quoted local-part. I ran tests on regex101.com vs. all the addresses below using some of my own tests and the tests on the wikipedia article about email addresses:
bob#smith.com
bob.smith#smith.com
bob-smith#smith.com
bob-smith#bob-smith.com
b0b!-...smith#smith.com <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
bob&smith#smith.com
"bob..smith"#smith.com
simple#example.com
very.common#example.com
disposable.style.email.with+symbol#example.com
other.email-with-hyphen#example.com
fully-qualified-domain#example.com
user.name+tag+sorting#example.com
x#example.com
example-indeed#strange-example.com
admin#mailserver1
example#s.example
" "#example.org
"john..doe"#example.org
Abc.example.com
A#b#c#example.com
a"b(c)d,e:f;g<h>i[j\k]l#example.com
just"not"right#example.com
this is"not\allowed#example.com
this\ still\"not\\allowed#example.com
1234567890123456789012345678901234567890123456789012345678901234+x#example.com
john..doe#example.com <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
john.doe#example..com
I attempted to use lookahead and lookbehind assertions to test for the consecutive periods, but I couldn't figure it out. I think that's the only thing it's missing (other than the comments, which for my purposes aren't required).
Is there a way to check for the periods that wouldn't alter what I currently have too much, or would it require a different approach?
Please let me know if I missed anything else.
Thank you.
You may add (?!("[^"]*"|[^"])*\.{2}) after ^.
See the regex demo.
The (?!("[^"]*"|[^"])*\.{2}) negative lookahead fails the match if, immediately to the right of the current location, there is
("[^"]*"|[^"])* - 0 or more occurrences of a " followed with 0+ chars other than " and then " or any char other than "
\.{2} - two consecutive dots.
I would recommend you read this. Suffice it to say that writing a regex that will work 100% is impossible.
I've written a non-Regex implementation here. If you port this to php and file an issue on my github page or send me an email (listed on my github page), I will happily link to it.
As you can tell from the unit tests, it's comprehensive enough to work with EAI addresses as well.
I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.
Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length
While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.
For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)
I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}
It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):
~^(a(?1)?b)$~
My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?
Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:
$regex = '~^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$~x';
var_dump(preg_match($regex, 'aabbcc')); // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc')); // 0
var_dump(preg_match($regex, 'aaaccc')); // 0
var_dump(preg_match($regex, 'aabcc')); // 0
var_dump(preg_match($regex, 'abbcc')); // 0
Try it yourself: http://codepad.viper-7.com/1erq9v
Explanation
If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:
~^a+(b(?-1)?c)$~
This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.
This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.
Conclusion
I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"
Here is an alternative solution using balancing groups with .NET regex:
^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$
Not PCRE, but may be of interest.
Example at: http://ideone.com/szhuE
Edit: Added the missing balancing check for the group a, and an online example.
My question is: How far can you go?
In the interests of not creating code that is an unreadable morass of punctuation, I'm going to risk the downvotes and answer a different, though very much related, question: how far should you go?
Regular expression parsers are a brilliant thing to have in your toolkit but they are not the be all and end all of programming. The ability to write parsers in a readable manner is also a brilliant thing to have in your toolkit.
Regular expressions should be used right up to the point where they start making your code hard to understand. Beyond that, their value is dubious at best, damaging at worst. For this specific case, rather than using something like the hideous:
~^(?=(a(?-1)?b)c)a+(b(?-1)?c)$~x
(with apologies to NikiC), which the vast majority of people trying to maintain it are either going to have to replace totally or spend substantial time reading up on and understanding, you may want to consider something like a non-RE, "proper-parser" solution (pseudo-code):
# Match "aa...abb...bcc...c" where:
# - same character count for each letter; and
# - character count is one or more.
def matchABC (string str):
# Init string index and character counts.
index = 0
dim count['a'..'c'] = 0
# Process each character in turn.
for ch in 'a'..'c':
# Count each character in the subsequence.
while index < len(str) and str[index] == ch:
count[ch]++
index++
# Failure conditions.
if index != len(str): return false # did not finish string.
if count['a'] < 1: return false # too few a characters.
if count['a'] != count['b']: return false # inequality a and b count.
if count['a'] != count['c']: return false # inequality a and c count.
# Otherwise, it was okay.
return true
This will be far easier to maintain in the future. I always like to suggest to people that they should assume those coming after them (who have to maintain the code they write) are psychopaths who know where you live - in my case, that may be half right, I have no idea where you live :-)
Unless you have a real need for regular expressions of this kind (and sometimes there are good reasons, such as performance in interpreted languages), you should optimise for readability first.
Qtax Trick
A solution that wasn't mentioned:
^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$
See what matches and fails in the regex demo.
This uses self-referencing groups (an idea #Qtax used on his vertical regex).
I have just started learning to code both PHP as well as HTML and had a look at a few tutorials on regular expressions however have a hard time understanding what these mean. I appreciate any help.
For example, I would like to validate the email address peanuts#monkey.com. I start off with the code and I get the message invalid email address.
What am I doing wrong?
I know that the metacharacters such as ^ denote the start of a string and $ denote the end of a string however what does this mean? What is the start of a string and what is the end of a string?
When do I group regular expressions?
$emailaddress = 'peanuts#monkey.com';
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {
echo 'Great, you have a valid email address';
} else {
echo 'boo hoo, you have an invalid email address';
}
What you have written works with some small modifications if that is what you want to use, however you miss a '+' at the end.
1)
^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[a-zA-Z0-9]+$
The caret and dollar character match positions rather than characters, ^ is equal to the beginning of line and $ is equal to the end of line, they are used to anchor your regex. If you write your regex without those two you will match email addresses everywhere in your text, not only the email addresses which is on a single line in this case. If you had written only the ^ (caret) you would have found every email address which is on the start of the line and if you had written only the $ (dollar) you would have found only the email addresses on the end of the line.
Blah blah blah someEmail#email.com
blah blah
would not give you a match because you do NOT have a email address at the beginning of line and the line does not terminate with it either so in order to match it in this context you would have to drop ^ and $.
Grouping is used for two reasons as far I know: Back referencing and... grouping. Grouping is used for the same reasons as in math, 1 + 3 * 4 is not the same as (1 + 3) * 4. You use parentheses to constrain quantifiers such as '+', '*' and '?' as well as alternation '|' etc.
You also parentheses for back referencing, but since I can't explain it better I would link you to: http://www.regular-expressions.info/brackets.html
I will encourage you to take a look at this book, even though you only read the first 2-3 chapters you will learn a lot and it is a great book! http://oreilly.com/catalog/9781565922570
And as the commentators say, this regex is not perfect but it works and show you what you had forgotten. You were not far away!
UPDATED as requested:
The '+', '*' and '?' are quantifiers. And is also a good example where you group.
'+' mean match whatever charachter preceeds it or group 1 or n times.
'*' mean match whatever charachter preceeds it 0 or n times.
'?' mean match whatever charachter preceeds it or the group 0 or 1 time.
n times meaning (indefinitely)
The reason why you use [a-zA-Z0-9]+ is without the '+' it will only match one character. With the + it will match many but it must match at least one. With * it match many but also 0, and ? will match 1 character at most but also 0.
Your regex doesn't match email addresses. Try this one:
/\b[\w\.-]+#[\w\.-]+\.\w{2,4}\b/
I recommend you read through this tutorial to learn about Regular Expressions.
Also, RegExr is great for testing them out.
As for your second question; the ^ character means that the regular expression must start matching from the first character in the string you input. The $ means that the regular expression must end at the final character in the string you input. In essence, this means that your regular expression will match the following string:
peanuts#monkey.com
but NOT the following string:
My email address is peanuts#monkey.com, and I love it!
Grouping regular expressions has lots of use cases. Using matching groups will also make your expression cleaner and more readable. It's all explained quite well in the tutorial I linked earlier.
As CanSpice points out, matching all possible email addresses isn't all that easy. Using the RFC2822 Email Validation expression will do a better job:
/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/
There are many alternatives, but even the simplest ones will do a fair job as most email addresses end in .com (or other 2-4 character length top domains).
The only reason your original expression doesn't work is that you're limiting the number of characters behind the period (.) in your expressions to 1. Changing your expression to:
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]+$/
Will allow for an infinite amount of characters behind the last period.
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]{2,4}$/
Will allow 2 to 4 characters behind the last period. That would match:
name#email.com
name#email.info
but not:
fake#address.suckers
The top level domain (".com," ".net," ".museum") can be from 2 to 6 characters. So you should be saying 2,6 instead of 2,4.
I wrote an extremely good email address regular expression a few years ago:
^\w+([-+._]\w+)#(\w+((-+)|.))\w{1,63}.[a-zA-Z]{2,6}$
A lot of research went into that. But I have some basic tips:
DON'T JUST COPY-PASTE! If someone says "here's a great regex for that," don't just copy paste it! Understand what's going on! Regular expressions are not that hard. And once you learn them well, it'll pay dividends forever. I got good at them by taking a class in Perl back in college. Since then, I've barely gotten any better and am WAY better than the vast majority of programmers I know. It's sad. Anyways, learn it!
Start small. Instead of building a giant regex and testing it when you're done, test just a few characters. For example, when writing an email validator, why not try \w+#\w+.\w+ and see how good that is? Add in a few more things and re-test. Like ^\w+#\w+.[A-Za-z]{2,6}$
The start and end of a regex string means that nothing can come before or after the characters you specify. Your regex string needs to account for underscores, needs capitals Zs with your capital ranges, and other adjustments.
/^[a-zA-Z_0-9]+#[a-zA-Z0-9]+\.[a-zA-z0-9]{2,4}$/
{2,4} says the top level domain is between 2 and 4 characters.
This will validate ANY email address (at least i've tried a lot )
preg_match("/^[a-z0-9._-]{2,}+\#[a-z0-9_-]{2,}+\.([a-z0-9-]{2,4}|[a-z0-9-]{2,}+\.[a-z0-9-]{2,4})$/i", $emailaddress);
Hope it works!
Make sure you ALWAYS escape metacharacters (like dot):
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {