I'm trying to remove / detect phone numbers from messages between users of my marketplace website (think eBay does something similar)
this is the code I'm using:
$string = preg_replace('/([0-9]+[\- ]?[0-9]+)/', '', $string);
BUT... it's too aggressive and it does strip away any number with 2 or more numerals... how can set a limit of say 7 numbers instead?
to be more precise the phone numbers can be any format like
3747657654
374-7657654
374-765-7654
(374)765-7654
etc...(i cannot predict what the users will write depending of their habits)
Try this regular expression :
/([0-9]+[\- ]?[0-9]{6,})/
changed to match your samples:
Regex101
That would depend on the exact requirements as now you have 1 or more numbers followed by an optional - or space followed by 1 or more numbers again.
If you wanted for example at least 2 numbers before the space or - followed by at least 5 numbers, you could use something like:
$string = preg_replace('/([0-9]{2,}[\- ]?[0-9]{5,})/', '', $string);
^^^^ Here you can specify mininimum / maximum
^^^^ Here you can specify mininimum / maximum
You can try something like this:
$string = preg_replace('/(?<![0-9]|[0-9]-)[0-9](?:[- ]?[0-9]){6}(?!-?[0-9])/', '', $string);
The lookarounds are here to avoid numbers with more than 7 digits, but if you want something more specific, you should provide an example string.
It is impossible to determine whether a number of X digits (where X is a valid phone number length) is a phone number or something else without some sort of context intelligence happening. A simple regex can't determine the difference between "call me at 3453456" and "call me when you've flown 3453456 miles".
Therefore trying to catch phone numbers without any formatting (just straight digits) with a regex is hopeless, pure and simple. Attempting to do so is only holding you back from finding a regex that can find formatted/semi-formatted numbers. What you should be going for here is "get the obvious and as many others as possible with minimal false positives...but recognize I can't get them all."
For that I'd recommend this:
/1?[ \-]?\(?([0-9]{3})?\)?[ \-]?([0-9]{3})[ \-]([0-9]{4})/g
It should not get the first three, but get all the rest in this list:
no-match: 3747657654
no-match: 444444444444444
no-match: 7657654
match: 374-765-7654
match: 1-374-765-7654
match: (374)765-7654
match: (374) 765 7654
match: 765-7654
match: 1 (374) 765 7654
match: 1(374)765 7654
Related
i am trying to write a small script to find out whether a given string contains a phone numer and / or email address.
Here is what i have so far:
function findContactInfo($str) {
// Find possible email
$pattern = '/[a-z0-9_\-\+]+#[a-z0-9\-]+\.[a-z]{2,3}?/i';
$emailresult = preg_match($pattern, $privateMessageText);
// Find possible phone number
preg_match_all('/[0-9]{3}[\-][0-9]{6}|[0-9]{3}[\s][0-9]{6}|[0-9]{3}[\s][0-9]{3}[\s][0-9]{4}|[0-9]{9}|[0-9]{3}[\-][0-9]{3}[\-][0-9]{4}/', $text,
$matches);
$matches = $matches[0];
}
The part with the emails works fine but i am open to improvements.
With the phone number i have some problems. First of all, the strings that will be given to the function will most likely contain german phone numbers. The problem with that are all the different formats. It could be something like
030 - 1234567 or 030/1234567 or 02964-723689 or 01718290918
and so on. So basically it is almost impossible for me to find out what combination will be used. So what i was thinking was, maybe it would be a good idea to just find a combination of a minimum of three digits following each other. Example:
$stringOne = "My name is John and my phone number is 040-3627781";
// would be found
$stringTwo "My name is Becky and my phone number is 0 4 0 3 2 0 5 4 3";
// would not be found
The problem i have with that is that i don't know how to find such combinations. Even after almost an hour of searching through the web i can't find a solution.
Does anyone have a suggestion on how to approach this?
Thanks!
You could use
\b\d[- /\d]*\d\b
See a demo on regex101.com.
Long version:
\b\d # this requires a "word boundary" and a digit
[- /\d]* # one of the characters in the class
\d\b # a digit and another boundary.
In PHP:
<?php
$regex = '~\b\d[- /\d]*\d\b~';
preg_match_all($regex, $your_string_here, $numbers);
print_r($numbers);
?>
Problem with this is, that you might get a lot of false positives, so it will certainly improve your accuracy when these matches are cleaned, normalized and then tested against a database.
As for your email question, I often use:
\S+#\S+
# not a whitespace, at least once
# #
# same as above
There are dozens of different valid emails, the only way to prove if there's an actual human being behind one is to send an email with a link (even this could be automated).
I am trying to use a regular expression to pick a phone number from a string, where the format of the phone number could be just about anything, or there may not be a phone number at all. For example:
$string = 'My phone number is +34 961 123456.';
$string = 'My phone number is +34 (961) 123456.';
$string = 'My phone number is 961-123456.';
$string = 'My phone number is +34.961.12.34.56.';
$string = 'Product A costs €100.00 and Product B costs €134.15.';
So far, I have got to
$number = preg_replace("/[^0-9\/\+\.\-\s]+/", "", $string);
$number = preg_replace("/[^0-9]+/", "", $number);
if (strlen($number)>8) {
/* It's a phone number, so do something with it */
}
This works for picking out all the different phone number formats that I have tried, but it also puts the prices together and assumes that they are a phone number too.
It seems that my problem is that a human can readily distinguish between a space between words and a space in the middle of a phone number, but how do I make the computer do that? Is there a way that I can replace spaces that are both preceded and followed by a number but leave other spaces intact? Is there some other way of sorting this out?
I'm afraid you aren't gonna like it. The regex I get is this:
(\+?[0-9]?[0-9]?[[:blank:],\.]?[0-9][0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9])
Explanation:
( <-- is for "grouping" and get the regular expression, probably not needed here
\+? <-- optional plus sign
[0-9]?[0-9]? <-- optional prefix code
[[:blank:],\.]? <-- optional space (or comma or dot) between the prefix code and the rest of the number
[0-9][0-9][0-9][[:blank:],\.]? <-- optional province code
[0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9] <-- number, composed by six numbers
Because these examples are for spanish telephone numbers, aren't they???
In that case, you've forgotten to give us examples of other formats, like "91 123 45 67", that might complicate the solution even more.
For these cases, I humbly think that is a best solution to make a little function. The regular expression is too complex to be a maintenable solution.
Looks like you want sequences of nine to twelve digits, with nothing between them except spaces, parentheses, periods or dashes; and possibly preceded by +. Try this:
preg_match_all("/\+?(?:\d[-. ()]*){9,12}/", $string, $results);
This isn't quite perfect, since trailing punctuation (like the period that follows all your examples) will be included in the matched string. Post-process the list of results to trim it:
preg_replace("/[-. ]+$/", "", $results);
Or you could standardize the collected phone numbers by removing all non-digits from the results, keeping just the digits and possibly an initial "+":
preg_replace("/[-. ()]/", "", $results);
Is there a function or a easy way to strip down phone numbers to a specific format?
Input can be a number (mobile, different country codes)
maybe
+4917112345678
+49171/12345678
0049171 12345678
or maybe from another country
004312345678
+44...
Im doing a
$mobile_new = preg_replace("/[^0-9]/","",$mobile);
to kill everything else than a number, because i need it in the format 49171 (without + or 00 at the beginning), but i need to handle if a 00 is inserted first or maybe someone uses +49(0)171 or or inputs a 0171 (needs to be 49171.
so the first numbers ALWAYS need to be countryside without +/00 and without any (0) between.
can someone give me an advice on how to solve this?
You can use
(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)
to match most of your cases and simply replace them with nothing. For example:
$mobile = "+4917112345678";
$mobile_new = preg_replace("/(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)/","",$mobile);
echo $mobile_new;
//output: 4917112345678
regex101 Demo
Explanation:
I'm making use of OR here, matching each of your cases one by one:
(?:^(?:00|\+|\+\d{2})) matches 00, + or + followed by two numbers at the beginning of your string
\/ matches a / anywhere in the string
\s matches a whitspace anywhere in the string (it matches the newline in the regex101 demo, but I suppose you match each number on its own)
\(\d\) matches a number enclosed in brackets anywhere in the string
The only case not covered by this regex is the input format 01712345678, as you can only take a guess what the country specific prefix can be. If you want it to be 49 by default, then simply replace each input starting with a single 0 with the 49:
$mobile = "01712345678";
$mobile_new = preg_replace("/^0/","49",$mobile);
echo $mobile_new;
//output: 491712345678
This pattern (49)\(?([0-9]{3})[\)\s\/]?([0-9]{8}) will split number in three groups:
49 - country code
3 digits - area code
8 digits - number
After match you can construct clean number just concatnating them by \1\2\3.
Demo: https://regex101.com/r/tE5iY3/1
If this not suits you then please explain more precisely what you want with test input and expected output.
I recommend taking a look at LibPhoneNumber by Google and its port for PHP.
It has support for many formats and countries and is well-maintained. Better not to figure this out yourself.
https://github.com/giggsey/libphonenumber-for-php
$phoneUtil = \libphonenumber\PhoneNumberUtil::getInstance();
$usNumberProto = $phoneUtil->parse("+1 650 253 0000", "US");
working on a project right now where we have large amount of text strings that we must localize phone numbers and make them clickable for android phones.
The phone numbers can be in different formats, and have different text before and after them. Is there any easy way of detecting every kind of phone number format? Some library that can be used?
Phone numbers can show like, has to work with all these combinations. So that the outcome is like
number
+61 8 9000 7911
+2783 207 5008
+82-2-806-0001
+56 (2) 509 69 00
+44 (0)1625 500125
+1 (305)409 0703
031-704 98 00
+46 31 708 50 60
Perhaps something like this:
/(\+\d+)?\s*(\(\d+\))?([\s-]?\d+)+/
(\+\d+)? = A "+" followed by one or more digits (optional)
\s* = Any number of space characters (optional)
(\(\d+\))? = A "(" followed by one or more digits followed by ")" (optional)
([\s-]?\d+)+ = One or more set of digits, optionally preceded by a space or dash
To be honest, though, I doubt that you'll find a one-expression-to-rule-them-all. Telephone numbers can be in so many different formats that it's probably impractical to match any possible format with no false positives or negatives.
Not sure that there is a library for that. Hmmmm. Like any international amenities, telephone numbers are standardised and there should be a format defining telephone numbers as well. E.164 suggests recommended telephone numbers: http://en.wikipedia.org/wiki/E.164 . All open-source decoding libraries are built from reading these standard formats, so it should be of some help if you really cant find any existing libs
I guess this might do it for these cases?
preg_replace("/(\+?[\d-\(\)\s]{7,}?\d)/", 'number', $str);
Basicly I check if it may start on +. It doesn't have to. Then I check if it got numbers, -, (, ) and spaces with at least 8 cases so it doesn't pick low non-phone numbers.
Try the following:
preg_match_all('/\+?[0-9][\d-\()-\s+]{5,12}[1-9]/', $text, $matches);
or:
preg_match_all('/(\+?[\d-\(\)\s]{8,20}[0-9]?\d)/', $text, $matches);
I have a 10 digit string being passed to me, and I want to verify that it is a valid ASIN before doing more processing and/or redirection.
I know that a non ISBN ASIN will always be non-numeric and 10 characters in length
I just want to be able to tell if the item being passed is a valid ASIN or is it just a search string after I have already eliminated that it could be a ISBN.
For example "SOUNDBOARD" is a search term while "B000J5XS3C" is an ASIN and "1412775884" is an ISBN.
Is there a lightweight way to check ASIN?
Update, 2017
#Leonid commented that he’s found the ASIN BT00LLINKI.
Although ASIN’s don’t seem to be strictly incremental, the oldest non-ISBN ASINs do tend to have more zeros than newer ASINs. Perhaps it was inevitable that we’d start seeing ASINs with no zero padding (and then what, I wonder...). So we’re now looking for "B" followed by nine alphanumeric characters (or an ISBN) — unfortunately, the "loss" of that zero makes it a lot easier to get a false positive.
/^(B[\dA-Z]{9}|\d{9}(X|\d))$/
Original answer
In Javascript, I use the following regexp to determine whether a string is or includes what’s plausibly an ASIN:
/^\s*(B\d{2}[A-Z\d]{7}|\d{9}[X\d])\s*$/
or, without worrying about extra whitespace or capturing:
/^(B\d{2}[A-Z\d]{7}|\d{9}[X\d])$/
As others have mentioned, Amazon hasn't really revealed the spec. In practice I've only seen two possible formats for ASINs, though:
10-digit ISBNs, which are 9 digits + a final character which may be a digit or "X".
The letter B followed by two digits followed by seven ASCII-range alphanumeric characters (with alpha chars being uppercase).
If anyone has encountered an ASIN that doesn't fit that pattern, chime in. It may actually be possible to get more restrictive than this, but I'm not certain. Non-ISBN ASINs might only use a subset of alphabetic characters, but even if so, they do use most of them. Some seem to appear more frequently than others, at least (K, Z, Q, W...)
For PHP, there is a valid regular expression for ASINs here.
function isAsin($string){
$ptn = "/B[0-9]{2}[0-9A-Z]{7}|[0-9]{9}(X|0-9])/";
return preg_match($ptn, $string, $matches) === 1;
}
maybe you could check on the amazon site whether the ASIN exists.
http://www.amazon.com/dp/YOUR10DIGITASIN
this URL return a http-statuscode=200 when the product exists and a 404 if that was not a valid ASIN.
After trying couple of solutions (including the top voted answer) they did not work well in PHP. (ex. 8619203011 is shown as ASIN)
Here is the solution that works very well:
function isAsin($string){
$ptn = "/^(?i)(B0|BT)[0-9A-Z]{8}$/";
if (preg_match($ptn, $string, $matches)) {
return true;
}
}
$testAsins = array('k023l5bix8', 'bb03l5bix8', 'b143l5bix8', 'bt00plinki', ' ', '');
foreach ($testAsins as $testAsin) {
if(isAsin($testAsin)){
echo $testAsin." is ASIN"."<br>";
} else {
echo $testAsin." is NOT ASIN"."<br>";
}
}
Explanation:
/^(?i)(B0|BT)[0-9A-Z]{8}$/
/^ = Beginning
(?i) = Case in-sensitive
(B0|BT)= Starting with B0 or BT
[0-9A-Z]= any numbers or letters
{8} = 8 numbers or letters allowed (on top of +2 from B0 or BT)