PHP Regular Expressions - Cannot get my head around - php

I am trying to create 3 PHP regular expressions which do three things..
Gets emails e.g mr.jones#apple-land.com
Gets dates e.g 31/05/90 or 31-Jun-90
Gets nameservers e.g ns1.apple.co.uk
I have a big chunk of text and want to extract these things from it.
What I have so far is:
$regexp = '/[A-Za-z0-9\.]+[#]{1}[A-Za-z0-9\.]+[A-Za-z]{2,4}/i';
preg_match_all($regexp, $output, $email);
$regexp = '/[A-Za-z0-9\.]+[^#]{1}/i';
preg_match_all($regexp, $output, $nameservers);
$regexp = '/[0-9]{2,4}[-\/]{1}([A-Za-z]{3}|[0-9]{2})[-\/]{1}[0-9]{2,4}/i';
preg_match_all($regexp, $output, $dates);
Dates and emails work, but i dont know if that is an efficient way to do it..
Nameservers just dont work at all.. essentially I want to find any combinations of letters and numbers which have dots in between but not # symbols..
Any help would be greatly appreciated.
Thanks

RegEx's for emails are fairly complex. This is one place where frameworks shine. Most of the popular ones have validation components which you can use to solve these problems. I'm most familiar with ZendFramework validation, and Symfony2 and CakePHP also provide good solutions. Often these solutions are written to the appropriate RFC specification and include support for things that programmers often overlook, like the fact that + is valid in an email address. They also protect against common mistakes that programmers make. Currently, your email regex will allow an email address that looks like this: .#.qt, which is not valid.
Some may argue that using a framework to validate an email or hostname (which can have a - in it as well) is overkill. I feel it is worth it.

essentially I want to find any combinations of letters and numbers
which have dots in between but not # symbols..
regexp for finding all letters and numbers which have dots in between:
$regexp '/[A-Za-z0-9]{1,}(\.[A-Za-z0-9]{1,}){1,}/i'
Please note that you don't have to make it explicit you don't want '#' if what you are matching on doesn't include the #.

I would recommend using different patterns for your examples:
[\w\.-]+#\w+\.[a-zA-Z]{2,4} for emails.
\d{1,2}[/-][\da-zA-Z]{1,3}[/-]\d{2,4} for dates.
([a-zA-Z\d]+\.){2,3}[a-zA-Z\d]+ for namespaces.
Good luck ;)

For the nameservers i would suggest using: /[^.](\.[a-z_\d]+){3,}/i

Related

Basic Regular Expression for

For some reason I always get stuck making anything past extremely basic regular expressions.
I'm trying to make a regular expression that kind of looks like a URL. I only want basic checking.
I would like it to match the following patterns where X is "something".
X://X.X
X://X.X... etc.
X.X
X.X... etc
If the string contains one of these patterns, it is sufficient checking for me. This way a url like www.example.com:8888 will still match. I have tried many different REGEX combinations with preg_match and cannot seem to get any to behave the way I want it to. I have consulted many other related REGEX questions on SO but my readings have not helped me.
Any help? I will be happy to provide more information if you would like but I don't know what else you would need.
It takes practice but here is one that I made using a regex tester (http://www.regextester.com/) to check my pattern:
^.+(:\/\/|\.)([a-zA-Z0-9]+\.)+.+
My approach is to slowly build my pattern from the beginning and add on one piece at a time. This cheatsheet is extremely helpful for remembering http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ what everything is.
Basically the pattern starts at the beginning of the string and checks for any characters followed by either :// or . then checks for groupings of letters and numbers followed by a . ending with any number of characters.
The pattern could probably be improved with groupings to not pass on invalid characters. But this one was quick and dirty. You could replace the first and last . with the characters that would be valid.
UPDATE
Per the comments here is an updated pattern:
^.+?(:\/\/|\.)?([a-zA-Z0-9]+?\.)+.+
/^(.+:\/\/)?[^.]+\.[^.\/]+([.\/][^.\/]+)*$/

PHP RegEx: Find Vulnerability Within Email Validation Pattern

The following regex pattern (for PHP) is meant to validate any email address:
^[\w.-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$
It says: "match at least one (or more) of upper- and/or lower-case letters, and/or periods, underscores and/or dashes followed by one and only one # followed by at least one (or more) of upper- and/or lower-case letters, and/or periods, and/or underscores followed by one and only one period followed by two to six upper- and/or lower-case letters.
This seems to match any email address I can think of. Still, this feeling of getting it right is probably deceptive. Can someone knowledgeable please point out an obvious or not-so-obvious vulnerability in this pattern that I'm not aware of, which would make it not perform the email validation the way it's meant to?
(To foresee a possible response, I'm aware that filter_var() function offers a more robust solution, but I'm specifically interested in the regular expression in this case.)
NOTE: this is a theoretical question about PHP flavor of regex, NOT a practical question about validating emails. I merely want to determine the limitations of what is reasonably possible with regex in this case.
Thank you in advance!
Using regular expression to validate emails is tricky
Try the following email as an input to your regex ie:^[\w.-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$
abc#b...com
You can read more about email regex validation at http://www.regular-expressions.info/email.html
If you are doing this for an app then use email validation by sending an email to the address provided rather than using very complex regex.
The email address specification is pretty nuts. There are regexen out there that can do a full validation for it, but they are thousands of characters long. It may be better to parse it on your own, but PHP has a built in validator for email addresses:
filter_var($email, FILTER_VALIDATE_EMAIL);
EDIT:
In answer to your specific question of an email address that will fail, any that has the email name in quotes will because you don't account for them at all:
"explosion-pills"#aysites.com

PHP - preg_match_all - Loose email match pattern that allows spaces and double #

I am going through our old site files and data that has our members emails and correspondence for 10 years.
I am extracting all of the email addresses (and botched email entries) and adding them to our new sites db.
It was a beginner attempt cms and had no error checking and validation.
So, I am having trouble matching emails with spaces and double #.
jam # spa ces1.com
jam#spac es2.com
jam##doubleats.org
I have constructed this loose regex that intentionally allows for a whole bunch of incorrect email formats but, the above three are examples of ones I can't figure out.
Here is my current "working" code:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b';
$pattern="/$pattern1|$pattern2/i";
$isago = preg_match_all($pattern,$text,$matches);
if ($isago) {.......
I need another pattern that would allow the three email examples above to be recognized as email addresses. (actual validation comes later)
Also, is there is any other patterns I could use that would allow me to recognize possible emails in the files?
Thanks for any help.
For the third case you can change your # to #{1,2}.
For the first and second you can add a space in your regex pattern1:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#{1,2}([ ]+|)([ a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#{1,2}[A-Z0-9.-]+\.[A-Z]{2,4}\b';
This answer is like a joke I know... but, how about this RegEx:
/[\S ]+#[\S ]+\.[\S ]+/i
That's works for you? I'm tested it in a document and match the three mails.
For general purpose you should use something like this:
/[A-Za-z0-9\._]+#[A-Za-z0-9\._]+\.[A-Za-z0-9\._]+/i
With that you would match all the emails, even separated by newline or commas.

have php got validate class or functions?

I'm searching php validate functions. In my opinion regex is hard.
Has Php got ready function?
i want to validate this variable:
Telephone Numer
E-mail adress
Maybe More...
And i want to deactive html tags incoming data from textarea.
Take a look at Validate filters, filter_var.
For the HTML tags you can either remove them or escape them
Regular Expressions are going to be the most robust way to validate things like phone numbers and email addresses.
They're not too hard to learn if you have a good resource. And it's an excellent tool to have in your developer toolbox. Check out http://www.regular-expressions.info/tutorial.html. Otherwise, I'm sure you can find some existing validation functions by doing a search on Google or SO.
Take a look at strip_tags() for removing HTML tags from strings.
Take a look at filter_var
Validating telephone numbers and email addresses is very hard, and complicated but conflicting opinions as to what constitutes a valid phone number (UK only? Some other country only? Multiple countries? What about extensions? Should the number be prefixed with a +? Are hyphens or other separating characters allowed?) or email address (see one take on this).
PHP has no built ins for this.

Regex problem Email test

i have some problem with pattern bellow:
/([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}#(([A-Z0-9]+([-][A-Z0-9])*){2,}\.)+([A-Z0-9]+([-][A-Z0-9])*){2,}/i
It match email addresses and i have problem with this rule:
[A-Z0-9\.\_\+\-]*
If i remove the star it works but i want this characters to be 0 or more. I tested it on http://regexpal.com/ and it works but on preg_match_all (PHP) - didn't work
Thanks
Why not use PHPs filter_var()
filter_var('test#email.com', FILTER_VALIDATE_EMAIL)
There is no good regex to validate email addresses. If you absolutely MUST use regex, then maybe have a look at Validate an E-Mail Address with PHP, the Right Way. Although, this is by no means a perfect measure either.
Edit: After some digging, I came across Mailparse.
Mailparse is an extension for parsing
and working with email messages. It
can deal with » RFC 822 and » RFC 2045
(MIME) compliant messages.
Mailparse is stream based, which means
that it does not keep in-memory copies
of the files it processes - so it is
very resource efficient when dealing
with large messages.
First of all, there are plenty of resources for this available. A quick search for "email validation regex" yields tons of results... Including This One...
Secondly, the problem is not in the * character. The problem is in the whole block.
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Look at what that's doing. It's basically saying match as many alpha-numerics as possible, then match as many alpha-numerics with other characters as possible, then repeat at least 3 and at most 64 times. That could be a LOT of characters...
Instead, you could do:
([A-Z0-9][A-Z0-9\.\_\+\-]{2,63})
Which will at most result in a match against a 64 character email.
Oh, and this is the pain of parsing emails with regex
There are plenty of other resources for validating email addresses (Including filter_var). Do some searching and see how the popular frameworks do it...
Try this regex :
/^[A-Z0-9][A-Z0-9\.\_\+\-]{3,64}#([A-Z0-9][-A-Z0-9]*\.)+[A-Z0-9]{2,}$/i
But like #Russell Dias said, you shouldn't use regex for emails.
While I agreed with Russel Dias, I believe your issue is with this entire block:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
Basically you are saying, you want;
Letters or numbers, 1 or more times
Letters or numbers, 0 or more times
Repeat the above between 3 and 64 times
You have quantity modifier after whole group:
([A-Z0-9]+[A-Z0-9\.\_\+\-]*){3,64}
So this will require minimum of 3 alphabetical characters and something like this:
a5________#gmail.com
will not work, but this:
a____a___a___#gmail.com
will do the work. Better find a ready well tested regex.
Also, you don't have starting and ending delimiter, so something like this will pass:
&^$##&$^##&aaa5a55a55a#gmail.comADA;'DROP TABLE :)

Categories