fully RFC5321- and 5322-compatible PHP PCRE regex

fully RFC5321- and 5322-compatible PHP PCRE regex - php

I'm trying to create a PHP PCRE regex that is (almost) fully compatible with RFC5321 and 5322 to test email addresses. The only thing I don't require is the (comment) part. I've seen some other attempts at this posted on here, but when I run tests vs. them they don't all work.
I have been working on one that is very close:
^(([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})|("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}"))#(([\w\-]*\.?[\w\-]*)|(\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])|(\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\]))$
To break it down:
Local part:
(
Match at most 64 of the allowed characters
([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})
|
OR match the same set of characters in a quoted string:
("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}")
)
end local part.
match # sign
#
match domain part:
(
match domain part using allowed characters:
([\w\-]*\.?[\w\-]*)
or ipv4 (it doesn't check to make sure they are < 255 - that would be handled elsewhere)
(\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])
or ipv6
(\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\])
)
The only thing it's missing is the ability to check for multiple consecutive .'s (periods) that are outside a quoted local-part. I ran tests on regex101.com vs. all the addresses below using some of my own tests and the tests on the wikipedia article about email addresses:
bob#smith.com
bob.smith#smith.com
bob-smith#smith.com
bob-smith#bob-smith.com
b0b!-...smith#smith.com <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
bob&smith#smith.com
"bob..smith"#smith.com
simple#example.com
very.common#example.com
disposable.style.email.with+symbol#example.com
other.email-with-hyphen#example.com
fully-qualified-domain#example.com
user.name+tag+sorting#example.com
x#example.com
example-indeed#strange-example.com
admin#mailserver1
example#s.example
" "#example.org
"john..doe"#example.org
Abc.example.com
A#b#c#example.com
a"b(c)d,e:f;g<h>i[j\k]l#example.com
just"not"right#example.com
this is"not\allowed#example.com
this\ still\"not\\allowed#example.com
1234567890123456789012345678901234567890123456789012345678901234+x#example.com
john..doe#example.com <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
john.doe#example..com
I attempted to use lookahead and lookbehind assertions to test for the consecutive periods, but I couldn't figure it out. I think that's the only thing it's missing (other than the comments, which for my purposes aren't required).
Is there a way to check for the periods that wouldn't alter what I currently have too much, or would it require a different approach?
Please let me know if I missed anything else.
Thank you.

You may add (?!("[^"]*"|[^"])*\.{2}) after ^.
See the regex demo.
The (?!("[^"]*"|[^"])*\.{2}) negative lookahead fails the match if, immediately to the right of the current location, there is
("[^"]*"|[^"])* - 0 or more occurrences of a " followed with 0+ chars other than " and then " or any char other than "
\.{2} - two consecutive dots.

I would recommend you read this. Suffice it to say that writing a regex that will work 100% is impossible.
I've written a non-Regex implementation here. If you port this to php and file an issue on my github page or send me an email (listed on my github page), I will happily link to it.
As you can tell from the unit tests, it's comprehensive enough to work with EAI addresses as well.

Related

RegEx to extract city and state from string AND know when someone leaves out the state part

i have the following code:
preg_match("/^(.+)[,\\s]+(.+?)\s*(\d{5})?$/", trim($searchbox), $matches);
list($arr['add'], $arr['city'], $arr['state']) = $matches;
$citystr = trim(str_replace(',', '', $arr['city']));
$statestr = trim($arr['state']);
This works great when someone types in "Granite Bay, CA", however i would like to modify it to catch when someone leave out the ", CA" part. So if someone only types "granite Bay", the code above is taking "Bay" as the state - thats no good. It also fails if someone adds a zip to the end like "Granite Bay, CA 00000"
Are there any modifications to this RegEx that i can do to avoid both these senarios?
TIA

Yes, you can build a less permissive/more detailed pattern:
^\h*([^,\s]+(?:\h+[^,\s]+)*+)\h*(?:,\h*([A-Z]+))?\h*(\d{5})?\h*$
demo
([^,\s]+(?:\h+[^,\s]+)*+) catches the city name as: something that doesn't start nor end with whitespaces and eventually in several parts.
(?:,\h*([A-Z]+))? makes all the state part optional. Note that I have chosen only uppercase letters for the state, but you can also make it case insensitive, it doesn't matter since the important point is the comma.
As an aside, if you want to be sure of what enter a user, use one field per information (one for the city, one for the state, one for the zip code).

You could go for:
^ # start of the string
(?P<town>[A-Z][^,]+) # uppercase, followed by not a comma
(?> # a non-capturing group
,\h*\K # a comma, horizontal whitespace, \K
(?P<state>[A-Z]{2}) # two UPPERCASE letters
)? # make the whole group optional
See a demo on regex101.com.
To be sure, you'll likely need some database of towns and states to check against, though (the above expression allows XY for a state as well), or as #Casimir points out, use several fields for each information.

Regular Expression to find Email in a Text

I am using PHP to find e-mail address in a given text.
My current regex is:
'/([\w+\.]*\w+#[\w+\.]*\w+[\w+\-\w+]*\.\w+)/is'
It is consuming a lot of CPU resources. Is there any optimized and low resource utilized ( i.e CPU ) RegEx for finding a Valid E-mails in a given text.

This
/^[^#]+#[a-z]+(\.[a-z]+)+$/
is better than yours.
Why?
Let's say we want to test this email: foo#bar.co.uk
In case of success my regexp perform 14 steps to find the solution.
Yours in 22 steps.
BUT THE BIGGEST DIFFERENCE IS IN NON-MATCHING CASE
Let's say we want to test this email: foo#bar.co.uk.foo.
My regex performs 31 steps and fails
Yours (that should be modified with ^ and $ delimiters, otherwise it will match this as a good one) performs 292 steps and fails!

Sometimes trading off some false positives for better performance is desirable:
/[^ #]*#[^ ]*/
This should be quite fast. It will also match stuff like __imp__MessageBoxW#16, but such constructs aren't that common in normal text.

Try this
/[-\d\w\W]+#[-\d\w.+_]+.\w{2,4}/
Matches:
hello.world#example.com
my_guru#lcoalhost
guy.31#site.co, etc...
Tested at http://regexr.com/

PHP - Why am I being warned that my regular expression is too large?

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16

The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.

Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length

While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.

For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)

Using a regular expression to validate email addresses

I have just started learning to code both PHP as well as HTML and had a look at a few tutorials on regular expressions however have a hard time understanding what these mean. I appreciate any help.
For example, I would like to validate the email address peanuts#monkey.com. I start off with the code and I get the message invalid email address.
What am I doing wrong?
I know that the metacharacters such as ^ denote the start of a string and $ denote the end of a string however what does this mean? What is the start of a string and what is the end of a string?
When do I group regular expressions?
$emailaddress = 'peanuts#monkey.com';
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {
echo 'Great, you have a valid email address';
} else {
echo 'boo hoo, you have an invalid email address';
}

What you have written works with some small modifications if that is what you want to use, however you miss a '+' at the end.
1)
^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[a-zA-Z0-9]+$
The caret and dollar character match positions rather than characters, ^ is equal to the beginning of line and $ is equal to the end of line, they are used to anchor your regex. If you write your regex without those two you will match email addresses everywhere in your text, not only the email addresses which is on a single line in this case. If you had written only the ^ (caret) you would have found every email address which is on the start of the line and if you had written only the $ (dollar) you would have found only the email addresses on the end of the line.
Blah blah blah someEmail#email.com
blah blah
would not give you a match because you do NOT have a email address at the beginning of line and the line does not terminate with it either so in order to match it in this context you would have to drop ^ and $.
Grouping is used for two reasons as far I know: Back referencing and... grouping. Grouping is used for the same reasons as in math, 1 + 3 * 4 is not the same as (1 + 3) * 4. You use parentheses to constrain quantifiers such as '+', '*' and '?' as well as alternation '|' etc.
You also parentheses for back referencing, but since I can't explain it better I would link you to: http://www.regular-expressions.info/brackets.html
I will encourage you to take a look at this book, even though you only read the first 2-3 chapters you will learn a lot and it is a great book! http://oreilly.com/catalog/9781565922570
And as the commentators say, this regex is not perfect but it works and show you what you had forgotten. You were not far away!
UPDATED as requested:
The '+', '*' and '?' are quantifiers. And is also a good example where you group.
'+' mean match whatever charachter preceeds it or group 1 or n times.
'*' mean match whatever charachter preceeds it 0 or n times.
'?' mean match whatever charachter preceeds it or the group 0 or 1 time.
n times meaning (indefinitely)
The reason why you use [a-zA-Z0-9]+ is without the '+' it will only match one character. With the + it will match many but it must match at least one. With * it match many but also 0, and ? will match 1 character at most but also 0.

Your regex doesn't match email addresses. Try this one:
/\b[\w\.-]+#[\w\.-]+\.\w{2,4}\b/
I recommend you read through this tutorial to learn about Regular Expressions.
Also, RegExr is great for testing them out.
As for your second question; the ^ character means that the regular expression must start matching from the first character in the string you input. The $ means that the regular expression must end at the final character in the string you input. In essence, this means that your regular expression will match the following string:
peanuts#monkey.com
but NOT the following string:
My email address is peanuts#monkey.com, and I love it!
Grouping regular expressions has lots of use cases. Using matching groups will also make your expression cleaner and more readable. It's all explained quite well in the tutorial I linked earlier.
As CanSpice points out, matching all possible email addresses isn't all that easy. Using the RFC2822 Email Validation expression will do a better job:
/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/
There are many alternatives, but even the simplest ones will do a fair job as most email addresses end in .com (or other 2-4 character length top domains).
The only reason your original expression doesn't work is that you're limiting the number of characters behind the period (.) in your expressions to 1. Changing your expression to:
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]+$/
Will allow for an infinite amount of characters behind the last period.
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]{2,4}$/
Will allow 2 to 4 characters behind the last period. That would match:
name#email.com
name#email.info
but not:
fake#address.suckers

The top level domain (".com," ".net," ".museum") can be from 2 to 6 characters. So you should be saying 2,6 instead of 2,4.
I wrote an extremely good email address regular expression a few years ago:
^\w+([-+._]\w+)#(\w+((-+)|.))\w{1,63}.[a-zA-Z]{2,6}$
A lot of research went into that. But I have some basic tips:
DON'T JUST COPY-PASTE! If someone says "here's a great regex for that," don't just copy paste it! Understand what's going on! Regular expressions are not that hard. And once you learn them well, it'll pay dividends forever. I got good at them by taking a class in Perl back in college. Since then, I've barely gotten any better and am WAY better than the vast majority of programmers I know. It's sad. Anyways, learn it!
Start small. Instead of building a giant regex and testing it when you're done, test just a few characters. For example, when writing an email validator, why not try \w+#\w+.\w+ and see how good that is? Add in a few more things and re-test. Like ^\w+#\w+.[A-Za-z]{2,6}$

The start and end of a regex string means that nothing can come before or after the characters you specify. Your regex string needs to account for underscores, needs capitals Zs with your capital ranges, and other adjustments.
/^[a-zA-Z_0-9]+#[a-zA-Z0-9]+\.[a-zA-z0-9]{2,4}$/
{2,4} says the top level domain is between 2 and 4 characters.

This will validate ANY email address (at least i've tried a lot )
preg_match("/^[a-z0-9._-]{2,}+\#[a-z0-9_-]{2,}+\.([a-z0-9-]{2,4}|[a-z0-9-]{2,}+\.[a-z0-9-]{2,4})$/i", $emailaddress);
Hope it works!

Make sure you ALWAYS escape metacharacters (like dot):
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {

how to validate an email address to this form: 234903284#student.uws.edu.au

I'd like to know a regex that would validate:
an email address to this form: 234903284#student.uws.edu.au
couple issues:
"student." is optional and could be any word eg "teacher.".
"324234234" can be any alpha numeric characters (number, word, _ etc.)
the email must end in "uws.edu.au"
This is what I have so far:
/(\d*)#\w*\.uws\.edu\.au/
valid addresses:
me#uws.edu.au
234234324#student.uws.edu.au
theking#teacher.uws.edu.au
etc.
Thanks Guys

Three thoughts:
Change the initial \d to \w to match "word" characters [a-zA-Z0-9_] instead of just digits.
Make the subdomain optional using ?
Use + instead of * when matching the username and subdomain. Otherwise #.uws.edu.au will validate.
Suggested:
/\w+#(\w+\.)?uws\.edu\.au/

You said:
Just tried /(\w*)#(\w*.)?uws.edu.au/ and that seemed to work. Any further suggestions are welcome – Jason 4 secs ago
Your regex will match "#teacher.uws.edu.au" (i.e. "name portion" omitted).
To fix this, you could use:
/(\w+)#(\w+\.)?uws\.edu\.au/
Which will require at least one character in the name portion, and at least one char before the dot (if there is a dot) in the subdomain spot.
Also (I think) that \w will not match . (and probably other chars that you care about in the name portion too), so bob.jones#student.uws.edu.au would fail to match. The following would add the char ., _, and - into the "name" portion:
/([\w\._-]+)#(\w*\.)?uws\.edu\.au/
you could add any other chars you need in the same way.
NOTE: Matching email addresses in general a more complex thing than you might think (lots of strange things are technically allowed in email addresses. Here is an article on the subject (There are many other sources of similar information available).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

fully RFC5321- and 5322-compatible PHP PCRE regex - php

Related

RegEx to extract city and state from string AND know when someone leaves out the state part

Regular Expression to find Email in a Text

PHP - Why am I being warned that my regular expression is too large?

Using a regular expression to validate email addresses

how to validate an email address to this form: 234903284#student.uws.edu.au

Categories

Resources