Related
I'm trying to remove / detect phone numbers from messages between users of my marketplace website (think eBay does something similar)
this is the code I'm using:
$string = preg_replace('/([0-9]+[\- ]?[0-9]+)/', '', $string);
BUT... it's too aggressive and it does strip away any number with 2 or more numerals... how can set a limit of say 7 numbers instead?
to be more precise the phone numbers can be any format like
3747657654
374-7657654
374-765-7654
(374)765-7654
etc...(i cannot predict what the users will write depending of their habits)
Try this regular expression :
/([0-9]+[\- ]?[0-9]{6,})/
changed to match your samples:
Regex101
That would depend on the exact requirements as now you have 1 or more numbers followed by an optional - or space followed by 1 or more numbers again.
If you wanted for example at least 2 numbers before the space or - followed by at least 5 numbers, you could use something like:
$string = preg_replace('/([0-9]{2,}[\- ]?[0-9]{5,})/', '', $string);
^^^^ Here you can specify mininimum / maximum
^^^^ Here you can specify mininimum / maximum
You can try something like this:
$string = preg_replace('/(?<![0-9]|[0-9]-)[0-9](?:[- ]?[0-9]){6}(?!-?[0-9])/', '', $string);
The lookarounds are here to avoid numbers with more than 7 digits, but if you want something more specific, you should provide an example string.
It is impossible to determine whether a number of X digits (where X is a valid phone number length) is a phone number or something else without some sort of context intelligence happening. A simple regex can't determine the difference between "call me at 3453456" and "call me when you've flown 3453456 miles".
Therefore trying to catch phone numbers without any formatting (just straight digits) with a regex is hopeless, pure and simple. Attempting to do so is only holding you back from finding a regex that can find formatted/semi-formatted numbers. What you should be going for here is "get the obvious and as many others as possible with minimal false positives...but recognize I can't get them all."
For that I'd recommend this:
/1?[ \-]?\(?([0-9]{3})?\)?[ \-]?([0-9]{3})[ \-]([0-9]{4})/g
It should not get the first three, but get all the rest in this list:
no-match: 3747657654
no-match: 444444444444444
no-match: 7657654
match: 374-765-7654
match: 1-374-765-7654
match: (374)765-7654
match: (374) 765 7654
match: 765-7654
match: 1 (374) 765 7654
match: 1(374)765 7654
I was hoping for a little help on this, as it's confusing me a little... I run a website that allows users to send messages back and forth, but on the inbox i need to hide both emails and phone numbers.
Example: This is how a sample email would look like.
Hi, my phone is +44 5555555 and email is jack#jack.com
I need it to be like this:
Hi, my phone is (phone hidden) and email is (email hidden)
Do you have any ideas ?... I really appreciate it!..
$x = 'Hi, my phone is +44 5555555 and email is jack#jack.com';
$x = preg_replace('/[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}/i','(phone hidden)',$x); // extract email
$x = preg_replace('/(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?/','(email hidden)',$x); // extract phonenumber
echo $x; // Hi, my phone is (phone hidden) and email is (email hidden)
kudo's for the phonenumber regex to fatcat
Trying to do this with 100% accuracy when users can type all sorts of things in is impossible - you can't really definitively say if a substring is a phone number or just another number, or an email address or just something that could be a valid one.
However, if you want to try, you should probably use a regular expression. See http://php.net/manual/en/function.preg-replace.php
<pre>
/*
* first par is given string
* second per is replace string like ****
* return result
*/
function email_phone_validation_replace_php($str='',$rep='*******') {
$str = preg_replace('/[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}/i',$rep,$str); // extract email
$str = preg_replace('/(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?/',$rep,$str); // extract phonenumber
return $str; // banarsiamin#gmail.com
}
$str = 'Hi, my name is amin khan banarsi <br> phone is +91 9770534045 and email is banarsiamin#gmail.com';
echo email_phone_validation_replace_php($str);
</pre>
If I understand correctly, your users can send messages to each other and you're worried that if they send a message with personal information in it that information might be too visible.
I guess that you're therefore trying to remove this information from the message's preview (but still have it available if you open the message?).
If this is the case then you can have a really sloppy regular expression removing anything that looks even a little bit like a number or email. It doesn't matter if you hide non-personal information because the non-censored version of the message is always available.
I would go with something like this (untested):
# Take any string that contains an # symbol and replace it with ...
# The # symbol must be surrounded by at least one character on both sides
$message = preg_replace('/[^ ]+#[^ ]+/','...',$message); # for emails
# Take any string that contains only numbers, spaces and dashes, replace with ...
# Can optionally have a + before it.
$message = preg_replace('/\+?[0-9\- ]+/','...',$message); # for phone numbers
This is going to match lots of things, more than just emails and phone numbers. It may also not match emails and phone numbers that I didn't think of, this is one of the problems with writing regular expressions for these kinds of things.
If you want to hide email and phone numbers from your Messages or Chat in PHP or any other language. You need to use regular expressions, read about regex on w3school.
I have an easy and complete solution for you.
<?php
$regex_email = '/[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})/';
$regex_phone = "/[0-9]{5,}|\d[ 0-9 ]{1,}\d|\sone|\stwo|\sthree|\sfour|\sfive|\ssix|\sseven|\seight|\snine|\sten/i";
$str = " Hello My Email soroutlove1996#gmail.com AND Phone No. is +919992799999 and +91 9992799999, or 9 9 9 2 7 9 9 9 9 9 and Nine Nine";;
$str = preg_replace($regex_email,'(email hidden)',$str); // extract email
$str = preg_replace($regex_phone,'(phone hidden)',$str); // extract phone
echo $str;
Output: Hello My Email (email hidden) AND Phone No. is +(phone
hidden) and +(phone hidden), or (phone hidden) and(phone hidden)(phone
hidden)
I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}
I have a php script which I need to validate several inputs with.
Is there any reliable and very good regular expression to check against when it comes to telephone nr, name and email adress validation?
Could somebody please supply these as I am very novice in regexp?
What I want is for example:
Telephone Nr: all number allowed, must be atleast 6 numbers, max 12 numbers, '+' sign allowed, space allowed, '-' sign allowed, as well as other things I haven't thought about yet.
Name: No numbers allowed, only characters in both lower and uppercase. Also the three swedish chars 'Å, Ä, Ö' in both lower and uppercase, also space, '-' sign allowed, and all others I havent thought about.
Email: Email adress is pretty standard over the world, so I don't know exactly what to ask for here, but you probably know what I want.
Thanks for all help
As Andrew White said, emails shouldn't be validated [only] by regex, but you can check out this one:
'/^([\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+#((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i'
it's closest to the email address spec I've ever found (no tests have been found which it fails)... can't remember where it's from, will edit my answer as soon as I find it again
[EDIT]
Found it, definitely worth a read: http://fightingforalostcause.net/misc/2006/compare-email-regex.php
[EDIT]
This should do for the phone numbers:
<?php
function is_valid_phonenumber( $subject ) {
// strip all valid chars
$stripped = preg_replace( '{[0-9 +-]}', '', $subject );
// check if there are remains, if yes: fail
if( !empty( $stripped ) )
return false;
// get digit count by replacing everything except digits with nothing
$digits = strlen( preg_replace( '{[^0-9]}', '', $subject ) );
// invalid if less than 6 or more than 12 in length
if( $digits < 6 || $digits > 12 )
return false;
// if nothing fails before this, we're good to go
return true;
}
?>
Similar can be done for the names, but don't forget the case-insensetive flag (i.e. '{pattern}i', there are also some good regex cheat sheets out there, for example this one from addedbytes.com: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
Validating telephone numbers with a regex is one thing but e-mails should not be validated by regex. I guess if you just wanted a very basic this-is-an-email regex you could use...
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
For telephone I think the following is good for US/Canada...
^[01]?[- .]?\(?[2-9]\d{2}\)?[- .]?\d{3}[- .]?\d{4}$
For names, good luck since names can be just about anything including numbers in some odd cases (Sr. Fracis John 2nd vs II). That all said I recommend you look into library specific validators for each type if it really matters but my PHP is a bit rusty so I don't have a recommendation there.
Have a read of: http://www.regular-expressions.info/email.html which discusses validating email addresses with Regex
I have a 10 digit string being passed to me, and I want to verify that it is a valid ASIN before doing more processing and/or redirection.
I know that a non ISBN ASIN will always be non-numeric and 10 characters in length
I just want to be able to tell if the item being passed is a valid ASIN or is it just a search string after I have already eliminated that it could be a ISBN.
For example "SOUNDBOARD" is a search term while "B000J5XS3C" is an ASIN and "1412775884" is an ISBN.
Is there a lightweight way to check ASIN?
Update, 2017
#Leonid commented that he’s found the ASIN BT00LLINKI.
Although ASIN’s don’t seem to be strictly incremental, the oldest non-ISBN ASINs do tend to have more zeros than newer ASINs. Perhaps it was inevitable that we’d start seeing ASINs with no zero padding (and then what, I wonder...). So we’re now looking for "B" followed by nine alphanumeric characters (or an ISBN) — unfortunately, the "loss" of that zero makes it a lot easier to get a false positive.
/^(B[\dA-Z]{9}|\d{9}(X|\d))$/
Original answer
In Javascript, I use the following regexp to determine whether a string is or includes what’s plausibly an ASIN:
/^\s*(B\d{2}[A-Z\d]{7}|\d{9}[X\d])\s*$/
or, without worrying about extra whitespace or capturing:
/^(B\d{2}[A-Z\d]{7}|\d{9}[X\d])$/
As others have mentioned, Amazon hasn't really revealed the spec. In practice I've only seen two possible formats for ASINs, though:
10-digit ISBNs, which are 9 digits + a final character which may be a digit or "X".
The letter B followed by two digits followed by seven ASCII-range alphanumeric characters (with alpha chars being uppercase).
If anyone has encountered an ASIN that doesn't fit that pattern, chime in. It may actually be possible to get more restrictive than this, but I'm not certain. Non-ISBN ASINs might only use a subset of alphabetic characters, but even if so, they do use most of them. Some seem to appear more frequently than others, at least (K, Z, Q, W...)
For PHP, there is a valid regular expression for ASINs here.
function isAsin($string){
$ptn = "/B[0-9]{2}[0-9A-Z]{7}|[0-9]{9}(X|0-9])/";
return preg_match($ptn, $string, $matches) === 1;
}
maybe you could check on the amazon site whether the ASIN exists.
http://www.amazon.com/dp/YOUR10DIGITASIN
this URL return a http-statuscode=200 when the product exists and a 404 if that was not a valid ASIN.
After trying couple of solutions (including the top voted answer) they did not work well in PHP. (ex. 8619203011 is shown as ASIN)
Here is the solution that works very well:
function isAsin($string){
$ptn = "/^(?i)(B0|BT)[0-9A-Z]{8}$/";
if (preg_match($ptn, $string, $matches)) {
return true;
}
}
$testAsins = array('k023l5bix8', 'bb03l5bix8', 'b143l5bix8', 'bt00plinki', ' ', '');
foreach ($testAsins as $testAsin) {
if(isAsin($testAsin)){
echo $testAsin." is ASIN"."<br>";
} else {
echo $testAsin." is NOT ASIN"."<br>";
}
}
Explanation:
/^(?i)(B0|BT)[0-9A-Z]{8}$/
/^ = Beginning
(?i) = Case in-sensitive
(B0|BT)= Starting with B0 or BT
[0-9A-Z]= any numbers or letters
{8} = 8 numbers or letters allowed (on top of +2 from B0 or BT)