How to make the regula expression correct? - php

I am not that familiar with regex or php, this line constantly return parsing error for detect email patterns:
with preg_match with the following inside which I changed from ereg:
if(!preg_match("/^(([A-Za-z0-9!#$%&'*+/=?^_{|}~-][A-Za-z0-9!#$%&'*+\/=?^_{|}~\.-]{0,63})|(\"[^(\|\")]{0,62}\"))$\", $local_array[$i]))
and:
if(!preg_match('/^(([A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9])\|([A-Za-z0-9]+))$/', $domain_array[$i]) )
I tried to add / before and after it / for the following, it seems ok.
^(([A-Za-z0-9!#$%&'*+/=?^_`{|}~-][A-Za-z0-9!#$%&'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$
The rest says:
Parse error: syntax error, unexpected '","' (T_CONSTANT_ENCAPSED_STRING), expecting ',' or ')'
How make it correct? It has parse errors when I switch from ereg to preg_match.
Thanks,
J.

Checking the validity of an e-mail according to the actual standard rather than just "[0-9A-z]#[0-9A-z]\\.(?i:[A-Z])" ?
Fantastic. As someone who uses a hyphen in their e-mail address, I wish there were more web-developers like you!
Here's the regex to match according to the RFC standard:
"/^([0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+|(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+\.\"(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~. (),:;<>#[]\E]+|\\\\\\\\|\\\\\")+\"\.)+[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+|\"(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~. (),:;<>#[]\E]+|\\\\\\\\|\\\\\")+\")#(?:[0-9A-Za-z\-\.]+|\[[0-9A-Za-z\-\.]+\])$/"
Yhikes. As you can see there are multiple parts to that pattern, and if-statement logic is much, much faster, and helps reduce the eye-sore of a pattern this is.
So, if you care about that sort of thing, I would recommend writing a function to check the e-mail address like so:
1) Check that neither the local or domain part of the e-mail address have leading, trailing, or consecutive dots, and that it is in the correct format. e.g.
if (!preg_match("/^\.|\.\.|\.#|#\.|\.$/",$email) && preg_match("/^[^#]+?#[^\\.]+?\..+$/",$email)) {
This ensures there is an '#' symbol for the next part, and if it fails here, saves what would have been a lot of unnecessary computing.
2) Tokenize the e-mail address by '#:'
$part = explode("#",$email);
3) Of course, there could be more than one '#,' so if the array has more than 2 elements, loop through each and re-concatenate all but the final element, so that you get two strings: the local part (before the mandatory '#') and the domain part.
4) If the first element/local part of the address does not contain any quotation marks ($), then use this pattern:
$pattern = "/^[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+$/";
5) Else if the local part begins AND ends with quotation marks, use this pattern:
$pattern = "/^\"(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~. (),:;<>#[]\E]+|\\\\\\\\|\\\\\")+\"$/";
5) Else if the local part contains TWO quotation marks (one only would invalidate), use this pattern:
$pattern = "/^(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+\.\"(?:[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~. (),:;<>#[]\E]+|\\\\\\\\|\\\\\")+\"\.)+[0-9A-Za-z\/\Q!#$%&'*+-=?^_`{}|~.\E]+$/";
7) Else the first part is invalid.
8) If the local part was valid: If the second element/domain part is either encapsulated within square brackets ([]) or contains NO square brackets (you can just use substr and substr_count for this, since it will be much faster than regex), and it matches the pattern:
preg_match("/^\[?[0-9A-Za-z\-\.]+\]?$/",$domainPart);
Then it is valid.
Note: According to the standard, e-mail addresses can actually contain comments (why, I have no idea). The comments are not actually part of the e-mail address, and get removed when it is used. For that reason, I didn't bother matching them.

Related

Match multiple characters without repetion on a regular expression

I'm using PHP's PCRE, and there is one bit of the regex I can't seem to do. I have a character class with 5 characters [adjxz] which can appear or not, in any order, after a token (|) on the string. They all can appear, but they can only each appear once. So for example:
*|ad - is valid
*|dxa - is valid
*|da - is valid
*|a - is valid
*|aaj - is *not* valid
*|adjxz - is valid
*|addjxz - is *not* valid
Any idea how I can do it? a simple [adjxz]+, or even [adjxz]{1,5} do not work as they allow repetition. Since the order does not matter also, I can't do /a?d?j?x?z?/, so I'm at a loss.
Perhaps using a lookahead combined with a backreference like this:
\|(?![adjxz]*([adjxz])[adjxz]*\1)[adjxz]{1,5}
demonstration
If you know these characters are followed by something else, e.g. whitespace you can simplify this to:
\|(?!\S*(\S)\S*\1)[adjxz]{1,5}
I think you should break this in 2 steps:
A regex to check for unexpected characters
A simple PHP check for duplicated characters
function strIsValid($str) {
if (!preg_match('/^\*|([adjxz]+)$/', $str, $matches)) {
return false;
}
return strlen($matches[1]) === count(array_unique(str_split($matches[1])));
}
I suggest using reverse logic where you match the unwanted case using this pattern
\|.*?([adjxz])(?=.*\1)
Demo

Regular expression to replace broken email links

Problem: authors have added email addresses wrongly in a CMS - missing out the 'mailto:' text.
I need a regular expression, if possible, to do a search and replace on the stored MySQL content table.
Cases I need to cope with are:
No 'mailto:'
'mailto:' is already included (correct)
web address not email - no replace
multiple mailto: required (more than one in string)
Sample string would be: (line breaks added for readability)
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
Required output would be:
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
What I tried (in PHP) and issues:
pattern: /href="(.+?)(#)(.+?)(<\/a> )/iU
replacement: href="mailto:$1$2$3$4
This is adding mailto: to the correctly formatted mailto: and acting greedily over the last two links.
Thanks for any help. I have looked about, but am running out of time on this as it was an unexpected content issue.
If you are able to save me time and give the SQL expression, that would be even better.
Try replace
/href="(?!(mailto:|http:\/\/|www\.))/iU
with
href="mailto:
?! loosely means "the next characters aren't these".
Alternative:
Replace
/(href=")(?!mailto:)([^"]+#)/iU
with
$1mailto:$2
[^"]+ means 1 or more characters that aren't ".
You'd probably need a more complex matching pattern for guaranteed correctness.
MySQL REGEX matching:
See this or this.
You need to apply a proper mail pattern first (e.g: Using a regular expression to validate an email address), second search for mailto:before mail or nothing (e.g: (mailto:|)), and last preg_replace_callback suits for this.
This looks like working as you wish (searching only email addresses in double quotes);
$s = 'add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com';
echo preg_replace_callback(
'~"(mailto:|)([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))"~i',
function($m) {
// print_r($m); #debug
return '"mailto:'. $m[2] .'"';
},
$s
);
Output as you desired;
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
Use the following as pattern:
/(href=")(?!mailto:)(.+?#.+?")/iU
and replace it with
$1mailto:$2
(?!mailto:) is a negative lookahead checking whether a mailto: follows. If there is no such one, remaining part is checked for matching. (.+?#.+?") matches one or more characters followed by a # followed by one or more characters followed by a ". Both + are non-greedy.
The matched pattern is replaced with first capture group (href=") followed by mailto: followed by second capture group (upto closing ").

regex for email validation: where is the error?

This sounds strange, but I've been using this function for quite a while now and "suddenly, from one day to the other" it does not filter some addresses in the right way anymore. However, I cannot see why...
function validate_email($email)
{
/*
(Name) Letters, Numbers, Dots, Hyphens and Underscores
(# sign)
(Domain) (with possible subdomain(s) ).
Contains only letters, numbers, dots and hyphens (up to 255 characters)
(. sign)
(Extension) Letters only (up to 10 (can be increased in the future) characters)
*/
$regex = '/([a-z0-9_.-]+)'. # name
'#'. # at
'([a-z0-9.-]+){2,255}'. # domain & possibly subdomains
'.'. # period
'([a-z]+){2,10}/i'; # domain extension
if($email == '') {
return false;
}
else {
$eregi = preg_replace($regex, '', $email);
}
return empty($eregi) ? true : false;
}
e.g. "some#gmail" will be shown as correct, etc so it seems sth happened with the tld - does anybody could tell me why?
Thank you very much in advance!
. means any character. You should escape it if you actually mean 'dot': \.
Your regex also has some other problems:
No uppercases are allowed in your regex: [a-zA-Z0-9]
No unicode characters are allowed in your regex (for example email addresses with é, ç, ... etc)
Some special characters such as + are in fact allowed in an email address
...
I would keep the email validation very simple. Like check if there is a # present and pretty much keep it at that. For if you really want to validate an email, the regex becomes gruesome.
Check this SO answer for a more detailed explanation.
What you commented with "period":
'.'. # period
is in fact a placeholder for any character. It should be \. instead.
However, you're overcomplicating things. Such validation should exist to reject either empty fields or obviously wrong stuff (e.g. name put in the email field). So in my experience the best check is just to look whether it contains an # and don't worry too much about getting the structure right. You can in fact write a regex which will faithfully validate any valid email address and reject any invalid one. It's a monster spanning about a screen of text. Don't do that. KISS.
I think the error is in this line:
'.'. # period
You mean a literal period here. But periods have a special meaning in regular expressions (they mean "any character").
You need to escape it with a backslash.
What about FILTER_VALIDATE_EMAIL

Regular expression fun with emails; top level domain not required when it should be

I'm trying to create a regular expressions that will filter valid emails using PHP and have ran into an issue that conflicts with what I understand of regular expressions. Here is the code that I am using.
if (!preg_match('/^[-a-zA-Z0-9_.]+#[-a-zA-Z0-9]+.[a-zA-Z]{2,4}$/', $string)) {
return $false;
}
Now from the materials that I've researched, this should allow content before the # to be multiple letters, numbers, underscores and periods, then afterwards to allow multiple letters and numbers, then require a period, then two to four letters for the top level domain.
However, right now it ignores the requirement for having the top level domain section. For example a#b.c obviously is valid (and should be), but a#b is also returning as valid, which I want ti to be flagged as not so.
I'm sure I"m missing something, but after browsing google for an hour I'm at a loss as to what it could be. Anyone have an answer for this conundrum?
EDIT: The speed that answers arrive here makes this site superior over it's competitors. Well done!
You should escape . when it's not a part of the group: '/^[-a-zA-Z0-9_.]+#[-a-zA-Z0-9]+\.[a-zA-Z]{2,4}$/'
Otherwise it will be equal to any letter:
. - any symbol (but not the newline \n if not using s modifier)
\. - dot symbol
[.] - dot symbol (inside symbol group)
Rather than rolling your own, perhaps you should read the article How to Find or Validate an Email Address on Regular-Expressions.info. The article also discusses reasons why you might not want to validate an email address using a regular expression and provides 3 regular expressions that you might consider using instead of your own.
From the page Comparing E-mail Address Validating Regular Expressions: Geert De Deckere from the Kohana project has developed a near perfect one:
/^[-_a-z0-9\'+*$^&%=~!?{}]++(?:\.[-_a-z0-9\'+*$^&%=~!?{}]+)*+#(?:(?![-.])[-a-z0-9.]+(?<![-.])\.[a-z]{2,6}|\d{1,3}(?:\.\d{1,3}){3})(?::\d++)?$/iD
But there is also a buildin function in PHP filter_var($email, FILTER_VALIDATE_EMAIL) but it seems to be under development. And there is an other serious solution: PEAR:Validate. I think the PEAR Solution is the best one.
An RFC822-compliant e-mail regex is available.
This is the most reasonable trade off of the spec versus real life that I have seen:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
#
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b
Of course, you have to remove the line breaks, and you have to update it if more top-level domains become available.
A single dot in a regular expression means "match any character". And that's exactly what is does when a top level domain is missing (also when it's present, of course).
Thus you should change your code like that:
if (!preg_match('/^[-a-zA-Z0-9_.]+#[-a-zA-Z0-9]+\.[a-zA-Z]{2,4}$/', $string)) {
return $false;
}
And by the way: a lot more characters are allowed in the local part than what your regular expression currently allows for.

Regular Expression to Detect a Specific Query

I wonder if you anyone can construct a regular expression that can detect if a person searches for something like "site:cnn.com" or "site:www.globe.com.ph/". I've been having the most difficult time figuring it out. Thanks a lot in advance!
Edit: Sorry forgot to mention my script is in PHP.
Ok, for input into an arbitary text field, something as simple as the following will work:
\bsite:(\S+)
where the parentheses will capture whatever site/domain they're trying to search. It won't verify it as valid, but validating urls/domains is complex and there are many easily googlable regexes for doing that, for instance, there's one here.
What are you matching against? A referer url?
Assuming you're matching against a referer url that looks like this:
http://www.google.com/search?client=safari&rls=en-us&q=whatever+site:foo.com&ie=UTF-8&oe=UTF-8
A regex like this should do the trick:
\bsite(?:\:|%3[aA])(?:(?!(?:%20|\+|&|$)).)+
Notes:
The colon after 'site' can either be unencoded or it can be percent encoded. Most user agents will leave it unencoded (which I believe is actually contrary to the standard), but this will handle both
I assumed the site:... url would be right-bounded by the equivalent of a space character, end of field (&) or end of string ($)
I didn't assume x-www-form-urlencoded encoding (spaces == '+') or spaces encoded with percent encoding (space == %20). This will handle both
The (?:...) is a non-capturing group. (?!...) is a negative lookahead.
no it's not for a referrer url. My php script basically spits out information about a domain (e.g. backlinks, pagerank etc) and I need that regex so it will know what the user is searching for. If the user enters something that doesn't match the regex, it does a regular web search instead.
If this is all you are trying to do, I guess I'd take the more simple approach and just do:
$entry = $_REQUEST['q'];
$tokens = split(':', trim($entry));
if (1 < count($tokens) && strtolower($tokens[0]) == 'site')
$site = $tokens[1];

Categories