I have to format a telephone number list, and I'd wish to extract and separate the prefix from the number for better viewing.
I have a list of all possible prefixes, but there is no regular pattern.
I mean, I could have these numbers:
00 - 12345 (short prefix)
0000 - 12345 (long prefix)
How can I manage that? Numbers are plain, without any special char (ie without / \ - . , ecc ecc).
Prefixes are like that:
030
031
0321
0322
...
...
Most of the time I have the town of the customer (it's not required) so, usually i can get the prefix from there, but that's not a sure thing, since town and telephone couldn't be linked.
== EDIT ==
Prefix list is 231 entries long. Maybe I'll add somthing more, so take 300 as safe value
Moreover, prefixes come from a single country only (Italy)
I have to save plain numbers without any separator so users can search for it. Infact if they put separators they will never able to find again that.
More info
Prefix ALWAYS starts with a leading 0, its lenght ranges from 2-4
But the more i study this thing, the more i think i can't work it out :(
Because of the extremely varied telephone number formats used around the world, it's probably going to be tough to correctly parse any phone number that is put into your system.
I'm not certain if it would make your ask any easier, but I had the idea that parsing from Right-to-Left might be easier for you, since it's the Prefix length that's unknown
What a pain. I would use a logic funnel to narrow possible choices and finally take a best guess.
First, test if the first few numbers can match anything on your prefix list. For some, hopefully only one prefix can possibly be correct.
Then, perhaps you could use the city to eliminate prefixes from entirely different countries.
Finally, you could default to the most popular format for prefixes.
Without any other information, you can't do better than a good guess unless you want to default to no format at all.
I'm really confused. What do you mean, "extract and separate"? My guess is these phone numbers are in a MySQL database, and you get to use PHP. Are you trying to get the prefix from the numbers, and then insert them into a different field in the same row? Are you pulling these numbers from the database, and you would just like to print the prefixes to the screen?
Regardless of what you're trying to do, and taking for granted that you're using PHP and regexs, isn't this essentially what you're looking for?:
$telephone_number = '333-12345';
$matched = array();
preg_match('~^(\d+)-~', $telephone_number, $matched);
$matched[1] // Should be '333'
ok, I worked it out.
I saw that there aren't shor prefixes that share chars with longer one.
I mean:
02 -> there will never be a longer prefix as 021, 022 and so on
so things are pretty easy now:
I get first 4 numbers -> is that in my prefix table?
YES: stop here
NO: get first 3
and so on..
thanks for your help
Related
I'm extremely new to Solr so go easy on me :)
I have a field for arguments sake stores a product sku! If the sku in a document was 'SKU12345' - how would I return the document if the query '1234' was entered?
I have previously tried using solr.EdgeNGramFilterFactory in the field type specific for the SKU but unfortunately this only works as a string prefix!
I want to try and avoid wild cards to keep performance optimal!
Thankssss :)
If you are new to Solr and you are beginning to implement features like this, I would recommend to read thoroughly through the chapter Understanding Analyzers, Tokenizers, and Filters of the reference guide. Since there are several ways to make your query match, but the best choice would depend on what you need.
Arun's suggestion is not bad, but the Ngrams alone are more geared to find general fractions of words. You would need this, if you want to do some sort of type-ahead or auto-completion. e.g. a User starts to type within an input field somewhere and you want to suggest previously made input that does match in fractions. If you try to make this match with Ngrams alone, your index may become quite large. Since you maybe required to index all permutations of the words to not miss the place where numbers/words start or end.
For your requirement I would tend to suggest the WordDelimiterFilter with splitOnNumerics="1". So the input SKU12345 would be indexed as follows
SKU12345
12345
SKU
So if a user searches for 12345 this would make a match.
If you want to match also fragments of that - like you said 1234 - I would then place a N-GramFilter afterwards. Then you will need to play around with minGramSize and maxGramSize. You will want to keep the gap between the two values low. Since the higher the gap the bigger your index will become.
e.g.
* minGramSize=4 and maxGramSize=5, gap of 1, few permutations
* minGramSize=1 and maxGramSize=5, gap of 4, more permutations
This depend on how small the user input shall be allowed to make a match.
If only the input shall match only from the start and shall not hit fragments in the middle, I would suggest the EdgeN-GramFilter as even better choice over the N-GramFilter. This will only generate fragments from the start of a word, not from the middle. This will lead to further reduction of the index size and better performance.
So if you want to make 2345 match SKU12345 you need Ngram, if only input as 1234 shall match SKU12345 EdgeNgram will do.
You can also set side to "back" to generate the ngrams from right to left.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I am struggling with the decision to separate phone numbers stored in a MySQL database.
One school of thought is to break out the phone as:
area code (123)
prefix (123)
suffix (1234)
Another is to simply place the file in a single field with whatever formatting deemed appropriate:
123456789
(123) 123-4567
123-456-7890
My initial reason for thinking the first would be better is in terms of being able to quickly and easily gather statistical data based on the phone numbers collected from our members (X number of members have a 123 area code for example).
Is there really a 'right' way to do it? I do realize that paired with PHP I can retrieve and reformat any way I want but I'd like to use best practice.
Thanks for your advice
EDIT
I will only be storing North American phone numbers for the time being
I vote for one field, processing the data as you put it in so that it's in a known format. I've tried both ways, and the one-field approach seems to generate less code overall.
You want to store it in the most efficient way in the DB, precisely because it's so easy to reformat in PHP. Go for the all-numeric field, with no separators (1231231234) since that would be the best way. If you have international phone numbers, add the country code as well. Then in your code you can format it using regular expressions to look however you want it.
I would store phone numbers as strings, not numbers.
Phone numbers are identifiers that happen to use digits.
Phone numbers starting with zero are valid, but may be interpreted as octal by a programming language.
Strip the phone number to only digits and store the extension in a separate field.
This will allow for uniform formatting later.
For US, strip the prepending ’1′ digit (and determine formatting based on length of the string (10 digits for US)).
I'm in the process of building a callcenter application (it manages queues of contact information for a group of distributed callers to contact) and the architecture specified one field, no spaces, dashes, etc. After quite a bit of analysis, I agree it seems the best.
Based on the variability of entry for phone numbers (apostrophes, dots, dashes, and combinations of each) I built a simple function that deals with user entry, stripping down all but the numbers themselves, and also a "rebuilder" that reformats the raw number into something that's more visually appealing to the user.
Since they've been helpful to me, here's what I've written so far:
public static function cleanPhoneNumbers($input) {
return preg_replace("/[^0-9]/", "", $input);
}
public static function formatPhoneNumbers($phone_number) {
if(strlen($phone_number) == 7) {
return preg_replace("/([0-9]{3})([0-9]{4})/", "$1-$2", $phone_number);
} elseif(strlen($phone_number) == 10) {
return preg_replace("/([0-9]{3})([0-9]{3})([0-9]{4})/", "$1-$2-$3", $phone_number);
} else {
return $phone_number;
}
}
Some caveats: My app is not available for international customers right now (there's a voip application built into it that we don't want to allow to call outside of the US right now) so I've not taken the time to setup for international possibilities. Also, as this is in progress, I will likely return to refactor and bolster these functions later.
I've found one weakness so far that has been a bit of a pain for me. In my app, I have to disallow calls to be made by timezone based on the time of day (for instance, don't allow someone on the West Coast to be called at 6:00am when it's 9:00am in the East) To do that, I have to join a separate area code table to my table with the phone numbers by comparing 3 digit area codes to get the timezone. But I can't simply compare the zip code to my phone number field, because they'd never match. So, I have to deal with additional SQL to get just the first three digits of the number. Not a game-changer, but more work and confusion nonetheless.
Definitely store them in one field as a text string, and only store the numbers. Think of it this way; no matter what the numbers are, its all one telephone number. However, the segmenting of the numbers is dependent on a number of things (locality, how many numbers provided, even personal preference). Easier to store the one and change it later with text manipulation.
I think splitting the number in 3 fields is the best options if you want to use area codes as filters, otherwise, you should only use 1 field.
Remember to use ZEROFILL is you plan on storing them as numbers ;)
it really depends on a couple factors:
is it possible you will have international numbers?
how much area code/city code searching/manipulation will you be doing?
No matter what, I would only store numbers, it's easy enough to format either in MySQL or PHP and add parentheses and dashes.
Unless I was going to do a log of searching by area code, I would just put the entire phone number into a single field since I assume most of the time you would be retrieving the entire phone number anyway.
If it's possible that you will take international numbers in the future:
You might want to add a country field though, that way you won't have to guess what country they are from when dealing with the number.
What you use depends on how you plan to use the data, and where the program will be used.
If you want to efficiently search records by area code, then split out the area code; queries will perform much faster when they're doing simple string comparisons versus string manipulation of the full phone number to get the area code.
HOWEVER, be advised that phone numbers formatted XXX-XXX-XXXX are only found in the US, Canada, and other smaller Caribbean territories that are subject to the NANPA system. Various other world regions (EU, Africa, ASEAN) have very different numbering standards. In such cases, splitting out the equivalent of the "area code" may not make sense. Also, if all you want to do is display a phone number to the user, then just store it as a string.
Whether to store a number with a format or not is mostly personal preference. Storing the raw number allows the formatting to be changed easily; you could go from XXX-XXX-XXXX to (XXX) XXX-XXXX by changing a couple lines of code instead of reformatting the 10 million numbers you already have. Removing special characters from a phone number is also a relatively simple Regex. Storing without formatting will also save you a few bytes per number and allow you to use a fixed-length field (saving further data overhead inherent in varchars). This may be of use in a mobile app where storage is at a premium. However, that 5-terabyte distributed SQL cluster in your server room is probably not gonna notice much difference between a char(10) and a varchar(15). Storing them formatted also speeds up loading the data; you don't have to format it first, just yank it out of the DB and plaster it on the page.
I'm trying to exract phone numbers from a set of data. It has to be able to extract international and local numbers from each country.
The rules I've laid out for it are:
1. Look for the international symbol, indicating it's an international dialing number with a valid extension(from +1 to +999).
2. If the plus symbol is present, make sure the next following character is a number.
3. If there is none, look at the length to validate it is between 7 and 10 digits long.
4. In the event that the number is divided (correctly via international standers) by either a hyphen(-) or space make sure the amount of digits in between them are either 3 or 4
What I've got so var is:
\+(?=[1-999])(\d{4}[0-9][-\s]\d{3}[0-9][-\s]\d{4}[0-9])|(\d{7,11}[0-9])
That's for international, and the local search is\d{7,10}
The thing is, that it doesn't actually pick up numbers with spaces or hyphens in it.
Can anybody give me some advice on it?
\d already means "digit", so you shouldn't put another [0-9] after it (which means the same).
In the same vein, [1-999] doesn't mean what you think it does. It in fact matches one (1) digit between 1 and 9. You probably want \d{1,3} although that would also match 0.
Then, you're only allowing one variation of dividing blocks (4-3-4) - why? This is not going to match many, many valid phone numbers.
I would suggest the following:
Search your string using the regex \+?(?=\d)[\d\s-]{7,13}\b to grab anything that remotely looks like a phone number. Perhaps you also want to include parentheses and slashes in the allowed character list: \+?(?=\d)[\d\s/()-]{7,14}\b
Then process and validate those strings separately, best after removing all punctuation/whitespace (except the +).
I'm not sure it will be possible to create a regex to match every country - some countries have conflicting rules.
it's entirely possible to have e.g. two valid local numbers contained within 1 valid international number.
You might want to start by looking at some of the answers to this question:
A comprehensive regex for phone number validation
If you're looking to create something definitive for every country, good luck, and you'll probably need to spend a while with some technical standards...
i.e. both 177 and 186-0039-011-81-90-1177-1177 are valid phone numbers in the same country
Using PHP, how can I verify if a phone # is well formed?
It seems easiest to simply strip all non-numeric data, leaving only the numbers. Then to check if 10 digits exist.
Is this the best and easiest way?
The best? No. Issues I see with this approach:
Some area codes - like 000-###-#### - are not valid. See http://en.wikipedia.org/wiki/List_of_NANP_area_codes
Some exchanges - like ###-555-#### - are not valid. See http://en.wikipedia.org/wiki/555_%28telephone_number%29
Some people will enter a 1 before their number, i.e. 1-###-###-####.
Some people are only reachable at an extension, like ###-###-#### x####.
Some companies tack on extra digits, like 1-800-GO-FLOWERS. The additional digits are simply ignored by the phone system, but a user might expect to be able to enter the whole thing.
International phone numbers are not necessarily 10 digits, even if you discount the country codes.
Good enough? Quite possibly, but that's up to you and your app.
You can use a regex for it:
$pattern_phone = "|^[0-9\+][0-9\s+\-]*$|i";
if(!preg_match($pattern_phone,$phone)){
//Somethings wrong
}
Haven't tested the regex, so it may not be 100% correct.
Checking for 10 digits after stripping will check the syntax but won't check the validity. For that you'd need to determine what valid numbers are available in the region/country and probably write a regex to match the patterns.
The problem with validating/filtering data like this usually comes down the the answer to this question: "How strict do I want to be?" which then devolves into a series of "feature" questions
Are you going to accept international numbers?
Are you going to accept extensions?
Are you going to allow various formats i.e., (111) 222-3333 vs 111.222.3333
Depending on your business rules, the answers to these questions can vary. But to be the most flexible, I recommend 3 fields to take a phone number
Country Code (optional)
Phone Number
Extension (optional)
All 3 fields can be programmatically limited/filters for numeric values only. You can then combine them before storing into some parse-able format, or store each value individually.
Answering if something is "the best" thing to do, is nearly impossible (unless you're the one answering your own question).
The way you propose it, stripping all non-digits and then check if there are 10 digits, might result in unwanted behaviour for a string like:
George Washington (February 22, 1732 –
December 14, '99) was the commander
of the Continental Army in the
American Revolutionary War and served
as the first President of the United
States of America.
since stripping all non-digits will result in the string 2217321499 which is 10 fdigits long, but I highly doubt that the entire string should be considered as a valid phone number.
What format you need? You can use regular expressions to this.