Identifying international phone numbers in string in PHP

Identifying international phone numbers in string in PHP - php

I am trying to write a function that will pull valid phone numbers from a string that are valid somewhere on the planet. This is for a truly international site for an organization that has locations all over the globe and users in each location accessing it.
I mainly need this for a database migration. The previous sites that I am migrating from only used a simple text field with not instructions and no filtering. So this results in the phone fields being used in all sorts of creative ways.
What I am looking for it just to identify the first phone number in the string, then possibly remove any excessive characters before setting the result as user profile information.

There's a PHP port available of Google's Phone Number Library.

you could use something like this:
$pattern = '/([\+_\-\(\)a-z ]+)/';
or
$pattern = '/([^0-9]+)/i';
$phone = preg_replace($pattern,'', $phone);
or, use a php filter like:
$phone = (int) filter_var($phone, FILTER_SANITIZE_NUMBER_INT);
although with the filter you would need to be careful if you were allowing the value to start with "0".
then, either way, check a range of lengths for allowed phone numbers ~6-12 or whatever your range covers.

First, you will need to compile a list of valid phone number formats.
Second, you will need to create regular expressions to identify each format.
Third, you will run the regexes against your text to locate the numbers.

Related

managing the phone number validation exploits using regex

I have written regex to validate the US and UK phone numbers. It is working fine but not for all case.
like it should not filter legitimate numbers like : 12345678 or 123456789, 1989 etc. Probably I need to validate each area code of US UK for first three digits. Am I right?
Here is the list of all UK area code: http://www.area-codes.org.uk/ big list. Do I need to include all of them in regex?
Issue: it should also filter exploits like this : 203453seven67
how it could be done?
Here is the example : http://ideone.com/zwzmKU
REgex:
$pattern = '((^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{5}\)?[\s-]?\d{4,5}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$)|(\(?[2-9][0-8][0-9]\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}))';

For making sure the phone number is correct, avoid using regex and use some standard library which can help you with the phone number validations.
I suggest https://code.google.com/p/libphonenumber/

Formatted amount to database format

hi I have a problem on formatted amounts.
On my input form, users can add and edit a formatted amount. Since this is a multi-language program, users can specify their own format, so there isn't a fixed pattern.
Examples:
250.000
250,000
250.000,00
250,000.00
Sadly, I have to "un-format" them, before store them into the database or MySQL will understand my number as floats and viceversa.
How can I overcome this? Any ideas?

You can either go on the side of allowing the users to enter what they want, in which case you'll ahve to sanitize (and translate in the case of the decimal symbol), as you guess here. Or you can restrict what they enter and force them to leave out thousand separators.

You will have to check the localization of the client and accordingly parse the data that way. coz 2,500.00 = 2.500,00. It depends on the location.
Do not let the users enter ",".
Have a separate box to let users enter things after the decimal.
Let the box be pre-filled with 0.00. Telling the user this is the format.

Taking input from user in a proper format is a good way, put a viewable format message displayed near the input area.
You should validate the input data when user perform submit.
You can validate using regular expression in both JavaScript and in your server-side programming language before storing the user-input directly into the database.

ok, finally I made it, so I'll share my solution
First of all, I used Mootools to format the input field while to user is typing.
Then I moved server side.
Format options are saved inside my application, so this is the code
//get decimal separator from saved params
$dec = $params->get('decimalSep', '.');
//remove any char that's not a number or decimal separator
//(i don't care about thousands sep)
$regex = '#[^0-9-'.preg_quote($dec).']#';
$normalized = preg_replace($regex, '', $price);
$normalized = str_replace($dec, '.', $normalized);
Basically I get rid of everything except the decimal separator, then I replace it with the standard one (.).
It works like a charm.

Better to store phone in three fields or as one?

I am struggling with the decision to separate phone numbers stored in a MySQL database.
One school of thought is to break out the phone as:
area code (123)
prefix (123)
suffix (1234)
Another is to simply place the file in a single field with whatever formatting deemed appropriate:
123456789
(123) 123-4567
123-456-7890
My initial reason for thinking the first would be better is in terms of being able to quickly and easily gather statistical data based on the phone numbers collected from our members (X number of members have a 123 area code for example).
Is there really a 'right' way to do it? I do realize that paired with PHP I can retrieve and reformat any way I want but I'd like to use best practice.
Thanks for your advice
EDIT
I will only be storing North American phone numbers for the time being

I vote for one field, processing the data as you put it in so that it's in a known format. I've tried both ways, and the one-field approach seems to generate less code overall.

You want to store it in the most efficient way in the DB, precisely because it's so easy to reformat in PHP. Go for the all-numeric field, with no separators (1231231234) since that would be the best way. If you have international phone numbers, add the country code as well. Then in your code you can format it using regular expressions to look however you want it.

I would store phone numbers as strings, not numbers.
Phone numbers are identifiers that happen to use digits.
Phone numbers starting with zero are valid, but may be interpreted as octal by a programming language.
Strip the phone number to only digits and store the extension in a separate field.
This will allow for uniform formatting later.
For US, strip the prepending ’1′ digit (and determine formatting based on length of the string (10 digits for US)).

I'm in the process of building a callcenter application (it manages queues of contact information for a group of distributed callers to contact) and the architecture specified one field, no spaces, dashes, etc. After quite a bit of analysis, I agree it seems the best.
Based on the variability of entry for phone numbers (apostrophes, dots, dashes, and combinations of each) I built a simple function that deals with user entry, stripping down all but the numbers themselves, and also a "rebuilder" that reformats the raw number into something that's more visually appealing to the user.
Since they've been helpful to me, here's what I've written so far:
public static function cleanPhoneNumbers($input) {
return preg_replace("/[^0-9]/", "", $input);
}
public static function formatPhoneNumbers($phone_number) {
if(strlen($phone_number) == 7) {
return preg_replace("/([0-9]{3})([0-9]{4})/", "$1-$2", $phone_number);
} elseif(strlen($phone_number) == 10) {
return preg_replace("/([0-9]{3})([0-9]{3})([0-9]{4})/", "$1-$2-$3", $phone_number);
} else {
return $phone_number;
}
}
Some caveats: My app is not available for international customers right now (there's a voip application built into it that we don't want to allow to call outside of the US right now) so I've not taken the time to setup for international possibilities. Also, as this is in progress, I will likely return to refactor and bolster these functions later.
I've found one weakness so far that has been a bit of a pain for me. In my app, I have to disallow calls to be made by timezone based on the time of day (for instance, don't allow someone on the West Coast to be called at 6:00am when it's 9:00am in the East) To do that, I have to join a separate area code table to my table with the phone numbers by comparing 3 digit area codes to get the timezone. But I can't simply compare the zip code to my phone number field, because they'd never match. So, I have to deal with additional SQL to get just the first three digits of the number. Not a game-changer, but more work and confusion nonetheless.

Definitely store them in one field as a text string, and only store the numbers. Think of it this way; no matter what the numbers are, its all one telephone number. However, the segmenting of the numbers is dependent on a number of things (locality, how many numbers provided, even personal preference). Easier to store the one and change it later with text manipulation.

I think splitting the number in 3 fields is the best options if you want to use area codes as filters, otherwise, you should only use 1 field.
Remember to use ZEROFILL is you plan on storing them as numbers ;)

it really depends on a couple factors:
is it possible you will have international numbers?
how much area code/city code searching/manipulation will you be doing?
No matter what, I would only store numbers, it's easy enough to format either in MySQL or PHP and add parentheses and dashes.
Unless I was going to do a log of searching by area code, I would just put the entire phone number into a single field since I assume most of the time you would be retrieving the entire phone number anyway.
If it's possible that you will take international numbers in the future:
You might want to add a country field though, that way you won't have to guess what country they are from when dealing with the number.

What you use depends on how you plan to use the data, and where the program will be used.
If you want to efficiently search records by area code, then split out the area code; queries will perform much faster when they're doing simple string comparisons versus string manipulation of the full phone number to get the area code.
HOWEVER, be advised that phone numbers formatted XXX-XXX-XXXX are only found in the US, Canada, and other smaller Caribbean territories that are subject to the NANPA system. Various other world regions (EU, Africa, ASEAN) have very different numbering standards. In such cases, splitting out the equivalent of the "area code" may not make sense. Also, if all you want to do is display a phone number to the user, then just store it as a string.
Whether to store a number with a format or not is mostly personal preference. Storing the raw number allows the formatting to be changed easily; you could go from XXX-XXX-XXXX to (XXX) XXX-XXXX by changing a couple lines of code instead of reformatting the 10 million numbers you already have. Removing special characters from a phone number is also a relatively simple Regex. Storing without formatting will also save you a few bytes per number and allow you to use a fixed-length field (saving further data overhead inherent in varchars). This may be of use in a mobile app where storage is at a premium. However, that 5-terabyte distributed SQL cluster in your server room is probably not gonna notice much difference between a char(10) and a varchar(15). Storing them formatted also speeds up loading the data; you don't have to format it first, just yank it out of the DB and plaster it on the page.

Format telephone number

I have to format a telephone number list, and I'd wish to extract and separate the prefix from the number for better viewing.
I have a list of all possible prefixes, but there is no regular pattern.
I mean, I could have these numbers:
00 - 12345 (short prefix)
0000 - 12345 (long prefix)
How can I manage that? Numbers are plain, without any special char (ie without / \ - . , ecc ecc).
Prefixes are like that:
030
031
0321
0322
...
...
Most of the time I have the town of the customer (it's not required) so, usually i can get the prefix from there, but that's not a sure thing, since town and telephone couldn't be linked.
== EDIT ==
Prefix list is 231 entries long. Maybe I'll add somthing more, so take 300 as safe value
Moreover, prefixes come from a single country only (Italy)
I have to save plain numbers without any separator so users can search for it. Infact if they put separators they will never able to find again that.
More info
Prefix ALWAYS starts with a leading 0, its lenght ranges from 2-4
But the more i study this thing, the more i think i can't work it out :(

Because of the extremely varied telephone number formats used around the world, it's probably going to be tough to correctly parse any phone number that is put into your system.
I'm not certain if it would make your ask any easier, but I had the idea that parsing from Right-to-Left might be easier for you, since it's the Prefix length that's unknown

What a pain. I would use a logic funnel to narrow possible choices and finally take a best guess.
First, test if the first few numbers can match anything on your prefix list. For some, hopefully only one prefix can possibly be correct.
Then, perhaps you could use the city to eliminate prefixes from entirely different countries.
Finally, you could default to the most popular format for prefixes.
Without any other information, you can't do better than a good guess unless you want to default to no format at all.

I'm really confused. What do you mean, "extract and separate"? My guess is these phone numbers are in a MySQL database, and you get to use PHP. Are you trying to get the prefix from the numbers, and then insert them into a different field in the same row? Are you pulling these numbers from the database, and you would just like to print the prefixes to the screen?
Regardless of what you're trying to do, and taking for granted that you're using PHP and regexs, isn't this essentially what you're looking for?:
$telephone_number = '333-12345';
$matched = array();
preg_match('~^(\d+)-~', $telephone_number, $matched);
$matched[1] // Should be '333'

ok, I worked it out.
I saw that there aren't shor prefixes that share chars with longer one.
I mean:
02 -> there will never be a longer prefix as 021, 022 and so on
so things are pretty easy now:
I get first 4 numbers -> is that in my prefix table?
YES: stop here
NO: get first 3
and so on..
thanks for your help

PHP and Regular Expressions question?

I was wondering if the codes below are the correct way to check for a street address, email address, password, city and url using preg_match using regular expressions?
And if not how should I fix the preg_match code?
preg_match ('/^[A-Z0-9 \'.-]{1,255}$/i', $trimmed['address']) //street address
preg_match ('/^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$/', $trimmed['email'] //email address
preg_match ('/^\w{4,20}$/', $trimmed['password']) //password
preg_match ('/^[A-Z \'.-]{1,255}$/i', $trimmed['city']) //city
preg_match("/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i", $trimmed['url']) //url

Your street address: ^[A-Z0-9 \'.-]{1,255}$
you need not escape the single quote.
since you have a dot in the char
class, it will allow all char (except
newline). So effective your regex becomes ^.{1,255}$
you are allowing it to be of min
length of 1 and max of length 255. I
would suggest you to increase the min
length to something more than 1.
Your email regex: ^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$
again you are having . in the char
class. fix that.
Your password regex: ^\w{4,20}$
allows for a passwd of length 4 to 20
and can contain only alphabets(upper
and lower), digits and underscore. I would suggest you to allow
special char too..to make your
password stronger.
Your city regex: ^[A-Z \'.-]{1,255}$
has . in char class
allows min length of 1 (if you want
to allow cities of 1 char length this
is fine).
EDIT:
Since you are very new to regex, spend some time on Regular-Expressions.info

This seems overly complicated to me. In particular I can see a few things that won't work:
Your regex will fail for cities with non-ASCII letters in their names, such as "Malmö" or 서울, etc.
Your password validator doesn't allow for spaces in the password (which is useful for entering pass-phrases) it doesn't even allow digits or punctuation, which many people will like to put in their passwords for added security.
You address validator won't allow for people who live in apartments (12/345 Foo St)
(this is assuming you meant "\." instead of "." since "." matches anything)
And so on. In general, I think over-reliance on regular expressions for validation is not a good thing. You're probably better off allowing anything for those fields and just validating them some other way.
For example, with email addresses: just because an address is valid according to the RFC standard doesn't mean you'll actually be able to send email to it (or that it's the correct email address for the person). The only reliable way to validate an email address is to actually send an email to it and get the person to click on a link or something.
Same thing with URLs: just because it's valid according to the standard doesn't actually mean there's a web page there. You can validate the URL by trying to do an actual request to fetch the page.
But my personal preference would be to just do the absolute minimum verification possible, and leave it at that. Let people edit their profile (or whatever it is you're verifying) in case they make a mistake.

There's not really a 'correct' way to check for any of those things. It depends on what exactly your requirements are.
For e-mail addresses and URLs, I'd recommend using filter_var instead of regexps - just pass it FILTER_VALIDATE_EMAIL or FILTER_VALIDATE_URL.
With the other regexps, you need to make sure you escape . inside character classes (otherwise it'll allow everything), and you might want to consider that the City/Street ones would allow rubbish such as ''''', or just whitespace.

Please don't assume that you know how an address is made up. There are thousands of cities, towns and villages with characters like & and those from other alphabets.
Just DON'T try to validate an address unless you do it thru an API specific to a country (USPS for the US, for example).
And why would you want to limit the characters in a users password? Don't have ANY requirements on the password except for it existing.
Your site will be unusable if you use those regex.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.