Regex problem Email test - php

i have some problem with pattern bellow:
/([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}#(([A-Z0-9]+([-][A-Z0-9])*){2,}\.)+([A-Z0-9]+([-][A-Z0-9])*){2,}/i
It match email addresses and i have problem with this rule:
[A-Z0-9\.\_\+\-]*
If i remove the star it works but i want this characters to be 0 or more. I tested it on http://regexpal.com/ and it works but on preg_match_all (PHP) - didn't work
Thanks

Why not use PHPs filter_var()
filter_var('test#email.com', FILTER_VALIDATE_EMAIL)
There is no good regex to validate email addresses. If you absolutely MUST use regex, then maybe have a look at Validate an E-Mail Address with PHP, the Right Way. Although, this is by no means a perfect measure either.
Edit: After some digging, I came across Mailparse.
Mailparse is an extension for parsing
and working with email messages. It
can deal with » RFC 822 and » RFC 2045
(MIME) compliant messages.
Mailparse is stream based, which means
that it does not keep in-memory copies
of the files it processes - so it is
very resource efficient when dealing
with large messages.

First of all, there are plenty of resources for this available. A quick search for "email validation regex" yields tons of results... Including This One...
Secondly, the problem is not in the * character. The problem is in the whole block.
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Look at what that's doing. It's basically saying match as many alpha-numerics as possible, then match as many alpha-numerics with other characters as possible, then repeat at least 3 and at most 64 times. That could be a LOT of characters...
Instead, you could do:
([A-Z0-9][A-Z0-9\.\_\+\-]{2,63})
Which will at most result in a match against a 64 character email.
Oh, and this is the pain of parsing emails with regex
There are plenty of other resources for validating email addresses (Including filter_var). Do some searching and see how the popular frameworks do it...

Try this regex :
/^[A-Z0-9][A-Z0-9\.\_\+\-]{3,64}#([A-Z0-9][-A-Z0-9]*\.)+[A-Z0-9]{2,}$/i
But like #Russell Dias said, you shouldn't use regex for emails.

While I agreed with Russel Dias, I believe your issue is with this entire block:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Basically you are saying, you want;
Letters or numbers, 1 or more times
Letters or numbers, 0 or more times
Repeat the above between 3 and 64 times

You have quantity modifier after whole group:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
So this will require minimum of 3 alphabetical characters and something like this:
a5________#gmail.com
will not work, but this:
a____a___a___#gmail.com
will do the work. Better find a ready well tested regex.
Also, you don't have starting and ending delimiter, so something like this will pass:
&^$##&$^##&aaa5a55a55a#gmail.comADA;'DROP TABLE :)

Related

PHP - preg_match_all - Loose email match pattern that allows spaces and double #

I am going through our old site files and data that has our members emails and correspondence for 10 years.
I am extracting all of the email addresses (and botched email entries) and adding them to our new sites db.
It was a beginner attempt cms and had no error checking and validation.
So, I am having trouble matching emails with spaces and double #.
jam # spa ces1.com
jam#spac es2.com
jam##doubleats.org
I have constructed this loose regex that intentionally allows for a whole bunch of incorrect email formats but, the above three are examples of ones I can't figure out.
Here is my current "working" code:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b';
$pattern="/$pattern1|$pattern2/i";
$isago = preg_match_all($pattern,$text,$matches);
if ($isago) {.......
I need another pattern that would allow the three email examples above to be recognized as email addresses. (actual validation comes later)
Also, is there is any other patterns I could use that would allow me to recognize possible emails in the files?
Thanks for any help.
For the third case you can change your # to #{1,2}.
For the first and second you can add a space in your regex pattern1:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#{1,2}([ ]+|)([ a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#{1,2}[A-Z0-9.-]+\.[A-Z]{2,4}\b';
This answer is like a joke I know... but, how about this RegEx:
/[\S ]+#[\S ]+\.[\S ]+/i
That's works for you? I'm tested it in a document and match the three mails.
For general purpose you should use something like this:
/[A-Za-z0-9\._]+#[A-Za-z0-9\._]+\.[A-Za-z0-9\._]+/i
With that you would match all the emails, even separated by newline or commas.

Why does this regex take so long to find email addresses in certain files?

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):
/[a-z0-9_\-\+]+#[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
I'm using preg_match_all() in PHP.
This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:
wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...
Here are two pages causing the problem. View source to see the long strings.
http://www.ashrae.org/members/page/607
http://www.ashrae.org/publications/page/2010ajindex
Any thoughts on what is causing this?
--FINAL SOLUTION--
I tested various regexes suggested in the answers. #FailedDev's answer helped and dropped processing time from a few minutes to a few seconds. #hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It's #hakre's second suggestion.
/[a-z0-9_\-\+]{1,256}+#[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?
For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.
If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":
/[a-z0-9_\-\+]{1,256}#[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i
That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:
/[a-z0-9_\-\+]{1,256}+#[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
Which should speed it up again.
My best guess would be to try using possesive quantifiers :
[a-z0-9_\-\+]+
to
[a-z0-9_\-\+]++
This should fail the regex faster so it may improve performance in these situations.
Edit:
Maybe atomic grouping could also help :
/(?>[a-z0-9_\-+]++)#(?>[a-z0-9\-]++\.)(?>[a-z]{2,3})(?:\.[a-z]{2})?/
You should first go with option one. It would be interesting to see if there is any difference by also using option two.

PHP Regular Expressions - Cannot get my head around

I am trying to create 3 PHP regular expressions which do three things..
Gets emails e.g mr.jones#apple-land.com
Gets dates e.g 31/05/90 or 31-Jun-90
Gets nameservers e.g ns1.apple.co.uk
I have a big chunk of text and want to extract these things from it.
What I have so far is:
$regexp = '/[A-Za-z0-9\.]+[#]{1}[A-Za-z0-9\.]+[A-Za-z]{2,4}/i';
preg_match_all($regexp, $output, $email);
$regexp = '/[A-Za-z0-9\.]+[^#]{1}/i';
preg_match_all($regexp, $output, $nameservers);
$regexp = '/[0-9]{2,4}[-\/]{1}([A-Za-z]{3}|[0-9]{2})[-\/]{1}[0-9]{2,4}/i';
preg_match_all($regexp, $output, $dates);
Dates and emails work, but i dont know if that is an efficient way to do it..
Nameservers just dont work at all.. essentially I want to find any combinations of letters and numbers which have dots in between but not # symbols..
Any help would be greatly appreciated.
Thanks
RegEx's for emails are fairly complex. This is one place where frameworks shine. Most of the popular ones have validation components which you can use to solve these problems. I'm most familiar with ZendFramework validation, and Symfony2 and CakePHP also provide good solutions. Often these solutions are written to the appropriate RFC specification and include support for things that programmers often overlook, like the fact that + is valid in an email address. They also protect against common mistakes that programmers make. Currently, your email regex will allow an email address that looks like this: .#.qt, which is not valid.
Some may argue that using a framework to validate an email or hostname (which can have a - in it as well) is overkill. I feel it is worth it.
essentially I want to find any combinations of letters and numbers
which have dots in between but not # symbols..
regexp for finding all letters and numbers which have dots in between:
$regexp '/[A-Za-z0-9]{1,}(\.[A-Za-z0-9]{1,}){1,}/i'
Please note that you don't have to make it explicit you don't want '#' if what you are matching on doesn't include the #.
I would recommend using different patterns for your examples:
[\w\.-]+#\w+\.[a-zA-Z]{2,4} for emails.
\d{1,2}[/-][\da-zA-Z]{1,3}[/-]\d{2,4} for dates.
([a-zA-Z\d]+\.){2,3}[a-zA-Z\d]+ for namespaces.
Good luck ;)
For the nameservers i would suggest using: /[^.](\.[a-z_\d]+){3,}/i

how to detect telephone numbers in a text (and replace them)?

I know it can be done for bad words (checking an array of preset words) but how to detect telephone numbers in a long text?
I'm building a website in PHP for a client who needs to avoid people using the description field to put their mobile phone numbers..(see craigslist etc..)
beside he's going to need some moderation but i was wondering if there is a way to block at least the obvious like nnn-nnn-nnnn, not asking to block other weird way of writing like HeiGHT*/four*/nine etc...
Welcome to the world of regular expressions. You're basically going to want to use preg_replace to look for (some pattern) and replace with a string.
Here's something to start you off:
$text = preg_replace('/\+?[0-9][0-9()\-\s+]{4,20}[0-9]/', '[blocked]', $text);
this looks for:
a plus symbol (optional), followed by a number, followed by between 4-20 numbers, brackets, dashes or spaces, followed by a number
and replaces with the string [blocked].
This catches all the obvious combinations I can think of:
012345 123123
+44 1234 123123
+44(0)123 123123
0123456789
Placename 123456 (although this one will leave 'Placename')
however it will also strip out any succession of 6+ numbers, which might not be desirable!
To do so you must use regular expressions as you may know.
I found this pattern that could be useful for your project:
<?php
preg_match("/(^(([\+]\d{1,3})?[ \.-]?[\(]?\d{3}[\)]?)?[ \.-]?\d{3}[ \.-]?\d{4}$)/", $yourText, $matches);
//matches variable will contain the array of matched strings
?>
More information about this pattern can be found here http://gskinner.com/RegExr/?2rirv where you can even test it online. It's a great tool to test regular expressions.
preg_match($pattern, $subject) will return 1 (true) if pattern is found in subject, and 0 (false) otherwise.
A pattern to match the example you give might be '/\d{3}-\d{3}\d{4}/'
However whatever you choose for your pattern will suffer from both false positives and false negatives.
You might also consider looking for words like mob, cell or tel next to the number.
The fill details of the php pattern matching can be found at http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Ian
p.s. It can't be done for bad words, as the people in Scunthorpe will tell you.
I think that use a too tight regular espression would lead to loose a great number of detections.
You should check for portions of 10 consecutive chatacters containing more than 5 digits.
So it is similar you will have an analisys routine queued to be called after any message insertion due to the computational weight.
After the 6 or more digits have been isolated replace them as you prefer, including other syblings digits.
Better in any case to preserve original data, so you can try and train your detection algorithm until it works the best way.
Then you can also study your user data to create more complex euristics, such like case insensitive numbers written as letters, mixed, dot separated, etc...
It's not about write the most perfect regex, is about approaching the problem statistically and dinamically.
And remember, after you take action, user will change their insertion habits as consequence, so stats will change and you will need to learn and update your euristics.

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

Categories