PHP RegEx: Find Vulnerability Within Email Validation Pattern - php

The following regex pattern (for PHP) is meant to validate any email address:
^[\w.-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$
It says: "match at least one (or more) of upper- and/or lower-case letters, and/or periods, underscores and/or dashes followed by one and only one # followed by at least one (or more) of upper- and/or lower-case letters, and/or periods, and/or underscores followed by one and only one period followed by two to six upper- and/or lower-case letters.
This seems to match any email address I can think of. Still, this feeling of getting it right is probably deceptive. Can someone knowledgeable please point out an obvious or not-so-obvious vulnerability in this pattern that I'm not aware of, which would make it not perform the email validation the way it's meant to?
(To foresee a possible response, I'm aware that filter_var() function offers a more robust solution, but I'm specifically interested in the regular expression in this case.)
NOTE: this is a theoretical question about PHP flavor of regex, NOT a practical question about validating emails. I merely want to determine the limitations of what is reasonably possible with regex in this case.
Thank you in advance!

Using regular expression to validate emails is tricky
Try the following email as an input to your regex ie:^[\w.-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$
abc#b...com
You can read more about email regex validation at http://www.regular-expressions.info/email.html
If you are doing this for an app then use email validation by sending an email to the address provided rather than using very complex regex.

The email address specification is pretty nuts. There are regexen out there that can do a full validation for it, but they are thousands of characters long. It may be better to parse it on your own, but PHP has a built in validator for email addresses:
filter_var($email, FILTER_VALIDATE_EMAIL);
EDIT:
In answer to your specific question of an email address that will fail, any that has the email name in quotes will because you don't account for them at all:
"explosion-pills"#aysites.com

Related

preg_match verification of non English email addresses (international domain names)

We all know email address verification is a touchy subject, there are so many opinions on the best way to deal with it without encoding for the entire RFC. But since 2009 its become even more difficult and I haven't really seen anyone address the issue of IDN's yet.
Here is what I've been using:
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,6}\z/i)
Which will work for most email addresses but what if I need to match a non Latin email address? e.g.: bob#china.中國, or bob#russia.рф
Look here for the complete list. (Notice all the non Latin domain extensions at the bottom of the list.)
Information on this subject can be found here and I think what they are saying is these new characters will simply be read as '.xn--fiqz9s' and '.xn--p1ai' on the machine level but I'm not 100% sure.
If it is, does that mean the only change I need to consider making in my code the following? (For domain extensions like .travelersinsurance and .sandvikcoromant)
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,20}\z/i)
NOTICE: This is not related to the discussion found on this page Using a regular expression to validate an email address
Consider: Every time you make up your own new regex without validating addresses according to the complete RFC spec, you're just making the situation for using "exotic" email addresses on the web worse. You're inventing some new ad-hoc sub or superset of the official RFC spec; that means you will either have false positives or false negatives or both, you will deny people to use their actual addresses because your regex doesn't account for them correctly, or you will accept addresses which are actually invalid.
Add to that that even if the address is syntactically valid, that still doesn't mean a) the address actually (still) exists, b) belongs to that user or c) can actually receive email. In the grant scheme of things, validating the syntax is an extremely minor concern.
If you're going to validate the syntax at all, either do a very rough general check which is sure to not reject any valid addresses (e.g. /.+#.+/), or validate according to all RFC rules; don't do some in-between half-assed sort-of-strict-but-not-really validation you just came up with.
I'm gonna stick with the tried and true suggestion that you should send them a verification email. No need for a fancy regex that will need to be updated time and time again. Just assume they know their email address and let them enter it.
That's what I've always done when this situation comes up. If anything I would make them enter their email twice. It'll free you up to spend more time on the important parts of your site/project.
Here is what I eventually came up with.
preg_match(/^[\pL\pM*+\pN._%+-]+#[\pL\pM*+\pN.-]+\.[\pL\pM*+]{2,20}\z/u)
This uses Unicode regular expressions like \pL, \pM*+ and \pN to help me deal with characters and numbers from any language.
\pL Any kind of letter from any language, upper or lower case.
\pM*+ Matches zero or more code points that are combining marks. A character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\pN Any number.
The expression above will work perfectly for normal email addresses like me#mydomain.com and cacophonous email addresses like a.s中3_yÄhমহাজোটেরoo文%网+d-fελληνικά#πyÄhooαράδειγμα.δοκιμή.
It's not that I don't trust people to be able to type in their own email addresses but people do make mistakes and I may use this code in other situations. For example: I need to double check the integrity of an existing list of 10,000 email addresses. Besides, I was always taught to NOT trust user input and to ALWAYS filter.
UPDATE
I just discovered that though this works perfectly when tested on sites like phpliveregex.com and locally when parsing a normal string for utf-8 content it doesn't work properly with email fields because browsers converting fields of that content type to normal latin. So an email address like bob#china.中國, or bob#russia.рф does get converted before being received by the server to bob#china.xn--fiqz9s, or bob#russia.xn--p1ai. The only thing I was really missing from my original filter was the inclusion of hyphens from the domain extention.
Here is the final version:
preg_match('/^[a-z0-9%+-._]+#[a-z0-9-.]+\.[a-z0-9-]{2,20}\z/i');

PHP - preg_match_all - Loose email match pattern that allows spaces and double #

I am going through our old site files and data that has our members emails and correspondence for 10 years.
I am extracting all of the email addresses (and botched email entries) and adding them to our new sites db.
It was a beginner attempt cms and had no error checking and validation.
So, I am having trouble matching emails with spaces and double #.
jam # spa ces1.com
jam#spac es2.com
jam##doubleats.org
I have constructed this loose regex that intentionally allows for a whole bunch of incorrect email formats but, the above three are examples of ones I can't figure out.
Here is my current "working" code:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b';
$pattern="/$pattern1|$pattern2/i";
$isago = preg_match_all($pattern,$text,$matches);
if ($isago) {.......
I need another pattern that would allow the three email examples above to be recognized as email addresses. (actual validation comes later)
Also, is there is any other patterns I could use that would allow me to recognize possible emails in the files?
Thanks for any help.
For the third case you can change your # to #{1,2}.
For the first and second you can add a space in your regex pattern1:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#{1,2}([ ]+|)([ a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#{1,2}[A-Z0-9.-]+\.[A-Z]{2,4}\b';
This answer is like a joke I know... but, how about this RegEx:
/[\S ]+#[\S ]+\.[\S ]+/i
That's works for you? I'm tested it in a document and match the three mails.
For general purpose you should use something like this:
/[A-Za-z0-9\._]+#[A-Za-z0-9\._]+\.[A-Za-z0-9\._]+/i
With that you would match all the emails, even separated by newline or commas.

have php got validate class or functions?

I'm searching php validate functions. In my opinion regex is hard.
Has Php got ready function?
i want to validate this variable:
Telephone Numer
E-mail adress
Maybe More...
And i want to deactive html tags incoming data from textarea.
Take a look at Validate filters, filter_var.
For the HTML tags you can either remove them or escape them
Regular Expressions are going to be the most robust way to validate things like phone numbers and email addresses.
They're not too hard to learn if you have a good resource. And it's an excellent tool to have in your developer toolbox. Check out http://www.regular-expressions.info/tutorial.html. Otherwise, I'm sure you can find some existing validation functions by doing a search on Google or SO.
Take a look at strip_tags() for removing HTML tags from strings.
Take a look at filter_var
Validating telephone numbers and email addresses is very hard, and complicated but conflicting opinions as to what constitutes a valid phone number (UK only? Some other country only? Multiple countries? What about extensions? Should the number be prefixed with a +? Are hyphens or other separating characters allowed?) or email address (see one take on this).
PHP has no built ins for this.

Regex problem Email test

i have some problem with pattern bellow:
/([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}#(([A-Z0-9]+([-][A-Z0-9])*){2,}\.)+([A-Z0-9]+([-][A-Z0-9])*){2,}/i
It match email addresses and i have problem with this rule:
[A-Z0-9\.\_\+\-]*
If i remove the star it works but i want this characters to be 0 or more. I tested it on http://regexpal.com/ and it works but on preg_match_all (PHP) - didn't work
Thanks
Why not use PHPs filter_var()
filter_var('test#email.com', FILTER_VALIDATE_EMAIL)
There is no good regex to validate email addresses. If you absolutely MUST use regex, then maybe have a look at Validate an E-Mail Address with PHP, the Right Way. Although, this is by no means a perfect measure either.
Edit: After some digging, I came across Mailparse.
Mailparse is an extension for parsing
and working with email messages. It
can deal with » RFC 822 and » RFC 2045
(MIME) compliant messages.
Mailparse is stream based, which means
that it does not keep in-memory copies
of the files it processes - so it is
very resource efficient when dealing
with large messages.
First of all, there are plenty of resources for this available. A quick search for "email validation regex" yields tons of results... Including This One...
Secondly, the problem is not in the * character. The problem is in the whole block.
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Look at what that's doing. It's basically saying match as many alpha-numerics as possible, then match as many alpha-numerics with other characters as possible, then repeat at least 3 and at most 64 times. That could be a LOT of characters...
Instead, you could do:
([A-Z0-9][A-Z0-9\.\_\+\-]{2,63})
Which will at most result in a match against a 64 character email.
Oh, and this is the pain of parsing emails with regex
There are plenty of other resources for validating email addresses (Including filter_var). Do some searching and see how the popular frameworks do it...
Try this regex :
/^[A-Z0-9][A-Z0-9\.\_\+\-]{3,64}#([A-Z0-9][-A-Z0-9]*\.)+[A-Z0-9]{2,}$/i
But like #Russell Dias said, you shouldn't use regex for emails.
While I agreed with Russel Dias, I believe your issue is with this entire block:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Basically you are saying, you want;
Letters or numbers, 1 or more times
Letters or numbers, 0 or more times
Repeat the above between 3 and 64 times
You have quantity modifier after whole group:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
So this will require minimum of 3 alphabetical characters and something like this:
a5________#gmail.com
will not work, but this:
a____a___a___#gmail.com
will do the work. Better find a ready well tested regex.
Also, you don't have starting and ending delimiter, so something like this will pass:
&^$##&$^##&aaa5a55a55a#gmail.comADA;'DROP TABLE :)

PHP and Regular Expressions question?

I was wondering if the codes below are the correct way to check for a street address, email address, password, city and url using preg_match using regular expressions?
And if not how should I fix the preg_match code?
preg_match ('/^[A-Z0-9 \'.-]{1,255}$/i', $trimmed['address']) //street address
preg_match ('/^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$/', $trimmed['email'] //email address
preg_match ('/^\w{4,20}$/', $trimmed['password']) //password
preg_match ('/^[A-Z \'.-]{1,255}$/i', $trimmed['city']) //city
preg_match("/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i", $trimmed['url']) //url
Your street address: ^[A-Z0-9 \'.-]{1,255}$
you need not escape the single quote.
since you have a dot in the char
class, it will allow all char (except
newline). So effective your regex becomes ^.{1,255}$
you are allowing it to be of min
length of 1 and max of length 255. I
would suggest you to increase the min
length to something more than 1.
Your email regex: ^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$
again you are having . in the char
class. fix that.
Your password regex: ^\w{4,20}$
allows for a passwd of length 4 to 20
and can contain only alphabets(upper
and lower), digits and underscore. I would suggest you to allow
special char too..to make your
password stronger.
Your city regex: ^[A-Z \'.-]{1,255}$
has . in char class
allows min length of 1 (if you want
to allow cities of 1 char length this
is fine).
EDIT:
Since you are very new to regex, spend some time on Regular-Expressions.info
This seems overly complicated to me. In particular I can see a few things that won't work:
Your regex will fail for cities with non-ASCII letters in their names, such as "Malmö" or 서울, etc.
Your password validator doesn't allow for spaces in the password (which is useful for entering pass-phrases) it doesn't even allow digits or punctuation, which many people will like to put in their passwords for added security.
You address validator won't allow for people who live in apartments (12/345 Foo St)
(this is assuming you meant "\." instead of "." since "." matches anything)
And so on. In general, I think over-reliance on regular expressions for validation is not a good thing. You're probably better off allowing anything for those fields and just validating them some other way.
For example, with email addresses: just because an address is valid according to the RFC standard doesn't mean you'll actually be able to send email to it (or that it's the correct email address for the person). The only reliable way to validate an email address is to actually send an email to it and get the person to click on a link or something.
Same thing with URLs: just because it's valid according to the standard doesn't actually mean there's a web page there. You can validate the URL by trying to do an actual request to fetch the page.
But my personal preference would be to just do the absolute minimum verification possible, and leave it at that. Let people edit their profile (or whatever it is you're verifying) in case they make a mistake.
There's not really a 'correct' way to check for any of those things. It depends on what exactly your requirements are.
For e-mail addresses and URLs, I'd recommend using filter_var instead of regexps - just pass it FILTER_VALIDATE_EMAIL or FILTER_VALIDATE_URL.
With the other regexps, you need to make sure you escape . inside character classes (otherwise it'll allow everything), and you might want to consider that the City/Street ones would allow rubbish such as ''''', or just whitespace.
Please don't assume that you know how an address is made up. There are thousands of cities, towns and villages with characters like & and those from other alphabets.
Just DON'T try to validate an address unless you do it thru an API specific to a country (USPS for the US, for example).
And why would you want to limit the characters in a users password? Don't have ANY requirements on the password except for it existing.
Your site will be unusable if you use those regex.

Categories