This may end up being a trivial question - I know I'm going to need to do this soon for an app I'm working on, but haven't really worked on it myself yet - I'm really just floating it to see if there's an obvious method I'm missing.
Basically, what I need is to generate a sequence of numbers using a-z, A-Z, 0-9, except without vowels. There is a small chance I will need to make it unpredictable, so being able to generate out of order is a bonus.
I'm initially thinking for each new one to just work forward from the last no-vowel match until I find the next one (or generate random numbers until I get one I don't have already in the case of unpredictable values), but is there a better way? Perhaps a baseX number system obj that allows you to specify the allowable characters?
Using PHP/MySQL if it matters.
There's a function in an answer of mine here that can convert from any base to any other and which lets you customize the digit pool; it also works on arbitrary-sized input. You can generate a sequence in base 10 and convert to whatever you need.
Related
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I have to format a telephone number list, and I'd wish to extract and separate the prefix from the number for better viewing.
I have a list of all possible prefixes, but there is no regular pattern.
I mean, I could have these numbers:
00 - 12345 (short prefix)
0000 - 12345 (long prefix)
How can I manage that? Numbers are plain, without any special char (ie without / \ - . , ecc ecc).
Prefixes are like that:
030
031
0321
0322
...
...
Most of the time I have the town of the customer (it's not required) so, usually i can get the prefix from there, but that's not a sure thing, since town and telephone couldn't be linked.
== EDIT ==
Prefix list is 231 entries long. Maybe I'll add somthing more, so take 300 as safe value
Moreover, prefixes come from a single country only (Italy)
I have to save plain numbers without any separator so users can search for it. Infact if they put separators they will never able to find again that.
More info
Prefix ALWAYS starts with a leading 0, its lenght ranges from 2-4
But the more i study this thing, the more i think i can't work it out :(
Because of the extremely varied telephone number formats used around the world, it's probably going to be tough to correctly parse any phone number that is put into your system.
I'm not certain if it would make your ask any easier, but I had the idea that parsing from Right-to-Left might be easier for you, since it's the Prefix length that's unknown
What a pain. I would use a logic funnel to narrow possible choices and finally take a best guess.
First, test if the first few numbers can match anything on your prefix list. For some, hopefully only one prefix can possibly be correct.
Then, perhaps you could use the city to eliminate prefixes from entirely different countries.
Finally, you could default to the most popular format for prefixes.
Without any other information, you can't do better than a good guess unless you want to default to no format at all.
I'm really confused. What do you mean, "extract and separate"? My guess is these phone numbers are in a MySQL database, and you get to use PHP. Are you trying to get the prefix from the numbers, and then insert them into a different field in the same row? Are you pulling these numbers from the database, and you would just like to print the prefixes to the screen?
Regardless of what you're trying to do, and taking for granted that you're using PHP and regexs, isn't this essentially what you're looking for?:
$telephone_number = '333-12345';
$matched = array();
preg_match('~^(\d+)-~', $telephone_number, $matched);
$matched[1] // Should be '333'
ok, I worked it out.
I saw that there aren't shor prefixes that share chars with longer one.
I mean:
02 -> there will never be a longer prefix as 021, 022 and so on
so things are pretty easy now:
I get first 4 numbers -> is that in my prefix table?
YES: stop here
NO: get first 3
and so on..
thanks for your help
I'm trying to exract phone numbers from a set of data. It has to be able to extract international and local numbers from each country.
The rules I've laid out for it are:
1. Look for the international symbol, indicating it's an international dialing number with a valid extension(from +1 to +999).
2. If the plus symbol is present, make sure the next following character is a number.
3. If there is none, look at the length to validate it is between 7 and 10 digits long.
4. In the event that the number is divided (correctly via international standers) by either a hyphen(-) or space make sure the amount of digits in between them are either 3 or 4
What I've got so var is:
\+(?=[1-999])(\d{4}[0-9][-\s]\d{3}[0-9][-\s]\d{4}[0-9])|(\d{7,11}[0-9])
That's for international, and the local search is\d{7,10}
The thing is, that it doesn't actually pick up numbers with spaces or hyphens in it.
Can anybody give me some advice on it?
\d already means "digit", so you shouldn't put another [0-9] after it (which means the same).
In the same vein, [1-999] doesn't mean what you think it does. It in fact matches one (1) digit between 1 and 9. You probably want \d{1,3} although that would also match 0.
Then, you're only allowing one variation of dividing blocks (4-3-4) - why? This is not going to match many, many valid phone numbers.
I would suggest the following:
Search your string using the regex \+?(?=\d)[\d\s-]{7,13}\b to grab anything that remotely looks like a phone number. Perhaps you also want to include parentheses and slashes in the allowed character list: \+?(?=\d)[\d\s/()-]{7,14}\b
Then process and validate those strings separately, best after removing all punctuation/whitespace (except the +).
I'm not sure it will be possible to create a regex to match every country - some countries have conflicting rules.
it's entirely possible to have e.g. two valid local numbers contained within 1 valid international number.
You might want to start by looking at some of the answers to this question:
A comprehensive regex for phone number validation
If you're looking to create something definitive for every country, good luck, and you'll probably need to spend a while with some technical standards...
i.e. both 177 and 186-0039-011-81-90-1177-1177 are valid phone numbers in the same country
What's the most efficient way to quickly check for telephone number format from a form input text?
easiest way is to strip out everything that is not a number and then count how many digits there are. That way you don't have to force a certain format on the user. As far as how many numbers you should check for...depends on if you are checking for just local numbers, US numbers, international numbers, etc...but for instance:
$number="(123) 123-1234";
if (strlen(preg_replace('~[^0-9]~','',$number)) == 10) {
// 10 digit number (US area code plus phone number)
}
That really depends on the location. There are different conventions from place-to-place as to the length of the area code and the length of the entire number. Being more permissive would be better.
A regular expression to ensure that it is an optional '+', followed by any number of digits would be the bare minimum. Allowing optional dashes ("-") or spaces separating groups of numbers could be a nice addition. Also, what about extension numbers?
In truth, it's probably best to just check that it includes some numbers.
If you are only dealing with U.S. phone numbers, you might follow the common UI design of using three textboxes = one for area code + second one for exchange + third one for number. That way, you can test that each one contains only digits, and that the correct number of digits is entered.
You can also use the common techniques of testing each keypress for numbers (ignoring keypresses that are not digits), and advancing to the next textbox after the required number of characters have been entered.
That second feature makes it easier for users to do the data entry.
Using separate textboxes also makes it a little easier for users to read their input, easier than, say, reading that they have entered ten digits in a row correctly.
It also avoids you having to deal with people who use different punctuation in their entries -- surrounding the area code with parentheses, using hyphens or dots between sections.
Edit: If you decide to stick with just one textbox, there are some excellent approaches to using regex in this SO question.
If you are only dealing with U.S. phone numbers, you might follow the common UI design of using three textboxes = one for area code + second one for exchange + third one for number. That way, you can test that each one contains only digits, and that the correct number of digits is entered.
If you are dealing with numbers worldwide, this breaks down because some countries don't use area codes at all and number lengths vary from country to country.
Subscriber numbers may be as short as 3 digits or as long as 11 or 12 digits. Area codes can range from 1 to 6 digits where used. The data will also need to be stored with the country code in order to have any chance of correctly formatting it for display.
I'm not sure what this is called, which is why I'm having trouble searching for it.
What I'm looking to do is to take numbers and convert them to some alphanumeric base so that the number, say 5000, wouldn't read as '5000' but as 'G4u', or something like that. The idea is to save space and also not make it obvious how many records there are in a given system. I'm using php, so if there is something like this built into php even better, but even a name for this method would be helpful at this point.
Again, sorry for not being able to be more clear, I'm just not sure what this is called.
You want to change the base of the number to something other than base 10 (I think you want base 36 as it uses the entire alphabet and numbers 0 - 9).
The inbuilt base_convert function may help, although it does have the limitation it can only convert between bases 2 and 36
$number = '5000';
echo base_convert($number, 10, 36); //3uw
Funnily enough, I asked the exact opposite question yesterday.
The first thing that comes to mind is converting your decimal number into hexadecimal. 5000 would turn into 1388, 10000 into 2710. Will save a few bytes here and there.
You could also use a higher base that utilizes the full alphabet (0-Z instead of 0-F) or even the full 256 ASCII characters. As #Yacoby points out, you can use base_convert() for that.
As I said in the comment, keep in mind that this is not an efficient way to mask IDs. If you have a security problem when people can guess the next or previous ID to a record, this is very poor protection.
dechex will convert a number to hex for you. It won't obfuscate how many records are in a given system, however. I don't think it will make it any more efficient to store or save space, either.
You'd probably want to use a 2 way crypt function if obfuscation is needed. That won't save space, either.
Please state your goals more clearly and give more background, because this seems a bit pointless as it is.
This might confuse more people than simply converting the base of the numbers ...
Try using signed digits to represent your numbers. For example, instead of using digits 0..9 for decimal numbers, use digits -5..5. This Wikipedia article gives an example for the binary representation of numbers, but the approach can be used for any numeric base.
Using this together with, say, base-36 arithmetic might satisfy you.
EDIT: This answer is not really a solution to the question, so ignore it unless you are trying to hash a number.
My first thought we be to hash it using eg. md5 or sha1. (You'd probably not save any space though...)
To prevent people from using rainbow-tables or brute force to guess which number you hashed, you can always add a salt. It can be as simple as a string prepended to your number before hashing it.
md5 would return an alphanumeric string of exactly 32 chars and sha1 would return one of exaclty 40 chars.