Ultimate way to find phone numbers in PHP string with preg_replace - php

working on a project right now where we have large amount of text strings that we must localize phone numbers and make them clickable for android phones.
The phone numbers can be in different formats, and have different text before and after them. Is there any easy way of detecting every kind of phone number format? Some library that can be used?
Phone numbers can show like, has to work with all these combinations. So that the outcome is like
number
+61 8 9000 7911
+2783 207 5008
+82-2-806-0001
+56 (2) 509 69 00
+44 (0)1625 500125
+1 (305)409 0703
031-704 98 00
+46 31 708 50 60

Perhaps something like this:
/(\+\d+)?\s*(\(\d+\))?([\s-]?\d+)+/
(\+\d+)? = A "+" followed by one or more digits (optional)
\s* = Any number of space characters (optional)
(\(\d+\))? = A "(" followed by one or more digits followed by ")" (optional)
([\s-]?\d+)+ = One or more set of digits, optionally preceded by a space or dash
To be honest, though, I doubt that you'll find a one-expression-to-rule-them-all. Telephone numbers can be in so many different formats that it's probably impractical to match any possible format with no false positives or negatives.

Not sure that there is a library for that. Hmmmm. Like any international amenities, telephone numbers are standardised and there should be a format defining telephone numbers as well. E.164 suggests recommended telephone numbers: http://en.wikipedia.org/wiki/E.164 . All open-source decoding libraries are built from reading these standard formats, so it should be of some help if you really cant find any existing libs

I guess this might do it for these cases?
preg_replace("/(\+?[\d-\(\)\s]{7,}?\d)/", 'number', $str);
Basicly I check if it may start on +. It doesn't have to. Then I check if it got numbers, -, (, ) and spaces with at least 8 cases so it doesn't pick low non-phone numbers.

Try the following:
preg_match_all('/\+?[0-9][\d-\()-\s+]{5,12}[1-9]/', $text, $matches);
or:
preg_match_all('/(\+?[\d-\(\)\s]{8,20}[0-9]?\d)/', $text, $matches);

Related

Php: how to remove phone numbers from string?

I'm trying to remove / detect phone numbers from messages between users of my marketplace website (think eBay does something similar)
this is the code I'm using:
$string = preg_replace('/([0-9]+[\- ]?[0-9]+)/', '', $string);
BUT... it's too aggressive and it does strip away any number with 2 or more numerals... how can set a limit of say 7 numbers instead?
to be more precise the phone numbers can be any format like
3747657654
374-7657654
374-765-7654
(374)765-7654
etc...(i cannot predict what the users will write depending of their habits)
Try this regular expression :
/([0-9]+[\- ]?[0-9]{6,})/
changed to match your samples:
Regex101
That would depend on the exact requirements as now you have 1 or more numbers followed by an optional - or space followed by 1 or more numbers again.
If you wanted for example at least 2 numbers before the space or - followed by at least 5 numbers, you could use something like:
$string = preg_replace('/([0-9]{2,}[\- ]?[0-9]{5,})/', '', $string);
^^^^ Here you can specify mininimum / maximum
^^^^ Here you can specify mininimum / maximum
You can try something like this:
$string = preg_replace('/(?<![0-9]|[0-9]-)[0-9](?:[- ]?[0-9]){6}(?!-?[0-9])/', '', $string);
The lookarounds are here to avoid numbers with more than 7 digits, but if you want something more specific, you should provide an example string.
It is impossible to determine whether a number of X digits (where X is a valid phone number length) is a phone number or something else without some sort of context intelligence happening. A simple regex can't determine the difference between "call me at 3453456" and "call me when you've flown 3453456 miles".
Therefore trying to catch phone numbers without any formatting (just straight digits) with a regex is hopeless, pure and simple. Attempting to do so is only holding you back from finding a regex that can find formatted/semi-formatted numbers. What you should be going for here is "get the obvious and as many others as possible with minimal false positives...but recognize I can't get them all."
For that I'd recommend this:
/1?[ \-]?\(?([0-9]{3})?\)?[ \-]?([0-9]{3})[ \-]([0-9]{4})/g
It should not get the first three, but get all the rest in this list:
no-match: 3747657654
no-match: 444444444444444
no-match: 7657654
match: 374-765-7654
match: 1-374-765-7654
match: (374)765-7654
match: (374) 765 7654
match: 765-7654
match: 1 (374) 765 7654
match: 1(374)765 7654

Stripping down Phonenumber (mobile)

Is there a function or a easy way to strip down phone numbers to a specific format?
Input can be a number (mobile, different country codes)
maybe
+4917112345678
+49171/12345678
0049171 12345678
or maybe from another country
004312345678
+44...
Im doing a
$mobile_new = preg_replace("/[^0-9]/","",$mobile);
to kill everything else than a number, because i need it in the format 49171 (without + or 00 at the beginning), but i need to handle if a 00 is inserted first or maybe someone uses +49(0)171 or or inputs a 0171 (needs to be 49171.
so the first numbers ALWAYS need to be countryside without +/00 and without any (0) between.
can someone give me an advice on how to solve this?
You can use
(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)
to match most of your cases and simply replace them with nothing. For example:
$mobile = "+4917112345678";
$mobile_new = preg_replace("/(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)/","",$mobile);
echo $mobile_new;
//output: 4917112345678
regex101 Demo
Explanation:
I'm making use of OR here, matching each of your cases one by one:
(?:^(?:00|\+|\+\d{2})) matches 00, + or + followed by two numbers at the beginning of your string
\/ matches a / anywhere in the string
\s matches a whitspace anywhere in the string (it matches the newline in the regex101 demo, but I suppose you match each number on its own)
\(\d\) matches a number enclosed in brackets anywhere in the string
The only case not covered by this regex is the input format 01712345678, as you can only take a guess what the country specific prefix can be. If you want it to be 49 by default, then simply replace each input starting with a single 0 with the 49:
$mobile = "01712345678";
$mobile_new = preg_replace("/^0/","49",$mobile);
echo $mobile_new;
//output: 491712345678
This pattern (49)\(?([0-9]{3})[\)\s\/]?([0-9]{8}) will split number in three groups:
49 - country code
3 digits - area code
8 digits - number
After match you can construct clean number just concatnating them by \1\2\3.
Demo: https://regex101.com/r/tE5iY3/1
If this not suits you then please explain more precisely what you want with test input and expected output.
I recommend taking a look at LibPhoneNumber by Google and its port for PHP.
It has support for many formats and countries and is well-maintained. Better not to figure this out yourself.
https://github.com/giggsey/libphonenumber-for-php
$phoneUtil = \libphonenumber\PhoneNumberUtil::getInstance();
$usNumberProto = $phoneUtil->parse("+1 650 253 0000", "US");

PHP phone number parser

Building an application for UK & Ireland only but potentially it might extend to other countries. We have built an API and I'm trying to decided how A) to store phone numbers B) how to write a parser to understand all formats for entry and comparision.
e.g.
Say a user is in Ireland they add a phone number in these formats
0871231234
087 123 1234
087-1231234
+353871231234
Or any other combination of writing a number a valid way. We want to allow for this so a new number can be added to our database in a consistent way
So all the numbers above potentially would be stored as 00353871231234
The problem is I will need to do parsing for all uk as well. Are there any classes out there that can help with this process?
Use regular expressions. An info page can be found here. It should not be too hard to learn, and will be extremely useful to you.
Here is the regular expresssion for validating phone numbers in the United Kingdom:
^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$
It allows 3, 4 or 5 digit regional prefix, with 8, 7 or 6 digit phone number respectively, plus optional 3 or 4 digit extension number prefixed with a # symbol. Also allows optional brackets surrounding the regional prefix and optional spaces between appropriate groups of numbers. More can be found here.
This Stackoverflow link should help you see how regular expressions can be used with phone numbers internationally.
?php
$array = array
(
'0871231234',
'087 123 1234',
'087-1231234',
'+353871231234'
);
foreach($array as $a)
if(preg_match("/(^[0-9]{10}$)|(^[0-9]{3}\ [0-9]{3}\ [0-9]{4}$)|(^[0-9]{3}\-[0-9]{7}$)|(^\+353[0-9]{9}$)/", $a))
{
// removes +35
$a = preg_replace("/^\+[0-9]{2}/", '', $a);
// removes first number
$a = preg_replace("/^[0-9]{1}/", '', $a);
// removes spaces and -
$a = preg_replace("/(\s+)|(\-)/", '', $a);
$a = "00353".$a;
echo $a."\n";
}
?>
Try http://www.braemoor.co.uk/software/telnumbers.shtml
Design the basic functionality for the UK first add on to it later if needed. You can separate the logic for each country if needed at a later stage. I would tend on the side of cautious optimism, you want to be accepting as many numbers as possible?
Strip out spaces, opening and closing brackets and -
If number starts with a single 0 replace with 00
If number starts with a + replace with a 00
If it is numeric and has a total length of between 9 and 11 characters we are 'good'
As for storage you could store it as a string... or as an integer, with a second field that contains the Qty of prefix '0's
Use this for reference
http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom

Matching Roman Numbers

I have regular expression
(IX|IV|V?I{0,3}|M{1,4}|CM|CD|D?C{1,3}|XC|XL|L?X{1,3})
I use it to detect if there is any roman number in text.
eregi("( IX|IV|V?I{0,3}[\.]| M{1,4}[\.]| CM|CD|D?C{1,3}[\.]| XC|XL|L?X{1,3}[\.])", $title, $regs)
But format of roman number is always like this: " IV."... I have added in eregi example white space before number and "." after number but I still get the same result. If text is something like "somethinvianyyhing" the result will be vi (between both)...
What am I doing wrong?
You have no space before VI the space belongs always to the alternative before it was written and not to all. The same for the \. it belongs always to the alternative where it was written.
Try this
" (IX|IV|V?I{0,3}|M{1,4}|CM|CD|D?C{1,3}|XC|XL|L?X{1,3})\."
See it here on Regexr
This will match
I.
II.
III.
IV.
V.
VI.
VII.
VIII.
IX.
X.
But not
XI.
MMI.
MMXI.
somethinvianyyhing
Your approach to match roman numbers is far from being correct, an approach to match the roman numbers more correct is this, for numbers till 50 (L)
^(?:XL|L|L?(?:IX|X{1,3}|X{0,3}(?:IX|IV|V|V?I{1,3})))$
See it here on Regexr
I tested this only on the surface, but you see this will really get complex and in this expression C, D and M are still missing.
Not to speak about special cases for example 4 = IV = IIII and there are more of them.
Wikipedia about Roman numbers

Determine if 10 digit string is valid Amazon ASIN

I have a 10 digit string being passed to me, and I want to verify that it is a valid ASIN before doing more processing and/or redirection.
I know that a non ISBN ASIN will always be non-numeric and 10 characters in length
I just want to be able to tell if the item being passed is a valid ASIN or is it just a search string after I have already eliminated that it could be a ISBN.
For example "SOUNDBOARD" is a search term while "B000J5XS3C" is an ASIN and "1412775884" is an ISBN.
Is there a lightweight way to check ASIN?
Update, 2017
#Leonid commented that he’s found the ASIN BT00LLINKI.
Although ASIN’s don’t seem to be strictly incremental, the oldest non-ISBN ASINs do tend to have more zeros than newer ASINs. Perhaps it was inevitable that we’d start seeing ASINs with no zero padding (and then what, I wonder...). So we’re now looking for "B" followed by nine alphanumeric characters (or an ISBN) — unfortunately, the "loss" of that zero makes it a lot easier to get a false positive.
/^(B[\dA-Z]{9}|\d{9}(X|\d))$/
Original answer
In Javascript, I use the following regexp to determine whether a string is or includes what’s plausibly an ASIN:
/^\s*(B\d{2}[A-Z\d]{7}|\d{9}[X\d])\s*$/
or, without worrying about extra whitespace or capturing:
/^(B\d{2}[A-Z\d]{7}|\d{9}[X\d])$/
As others have mentioned, Amazon hasn't really revealed the spec. In practice I've only seen two possible formats for ASINs, though:
10-digit ISBNs, which are 9 digits + a final character which may be a digit or "X".
The letter B followed by two digits followed by seven ASCII-range alphanumeric characters (with alpha chars being uppercase).
If anyone has encountered an ASIN that doesn't fit that pattern, chime in. It may actually be possible to get more restrictive than this, but I'm not certain. Non-ISBN ASINs might only use a subset of alphabetic characters, but even if so, they do use most of them. Some seem to appear more frequently than others, at least (K, Z, Q, W...)
For PHP, there is a valid regular expression for ASINs here.
function isAsin($string){
$ptn = "/B[0-9]{2}[0-9A-Z]{7}|[0-9]{9}(X|0-9])/";
return preg_match($ptn, $string, $matches) === 1;
}
maybe you could check on the amazon site whether the ASIN exists.
http://www.amazon.com/dp/YOUR10DIGITASIN
this URL return a http-statuscode=200 when the product exists and a 404 if that was not a valid ASIN.
After trying couple of solutions (including the top voted answer) they did not work well in PHP. (ex. 8619203011 is shown as ASIN)
Here is the solution that works very well:
function isAsin($string){
$ptn = "/^(?i)(B0|BT)[0-9A-Z]{8}$/";
if (preg_match($ptn, $string, $matches)) {
return true;
}
}
$testAsins = array('k023l5bix8', 'bb03l5bix8', 'b143l5bix8', 'bt00plinki', ' ', '');
foreach ($testAsins as $testAsin) {
if(isAsin($testAsin)){
echo $testAsin." is ASIN"."<br>";
} else {
echo $testAsin." is NOT ASIN"."<br>";
}
}
Explanation:
/^(?i)(B0|BT)[0-9A-Z]{8}$/
/^ = Beginning
(?i) = Case in-sensitive
(B0|BT)= Starting with B0 or BT
[0-9A-Z]= any numbers or letters
{8} = 8 numbers or letters allowed (on top of +2 from B0 or BT)

Categories