This whole problem has come up because our data input people are useless. We have a form for adding items to a database, and one of the fields is a price. The format is something like lowest - highest (lowest without 10% fee - highest without 10% fee), e.g. 11 - 22 (10 - 20)
The problem is the people adding this data are REALLY inconsistent with adding the pound sign, so some are like 11-£22(£10-20), so my idea is when I'm bringing back the data, remove any £ sign in there, and re add them all, so they will all look the same.
I'm guessing to do this I will need some sort of RegEx to match something, but I'm not sure what the pattern would be.
Can anyone help me figure out what RegEx I'd need to use?
If your regex flavour supports lookarounds you could use the expression:
£?(?<!\d)(\d+)
and use the following as the replacement:
£\1
This should work fine in PHP
You could also use this expression if you expect the price to contain commas and full-stops
£?(?<![0-9,.])(\d+)
A simpler solution would be to provide a drop down with a list of currency symbols. That way the addition of the symbol is obvious to the users.
You can still add an expression, could replace all non-numeric characters and allow a single dot character and many commas.
You could also user javascript to restrict the entered characters, but provide server side validation/modification anyway.
You can simply do this:
$result = preg_replace('~£?(\d+)~', '£$1', '11-£22(£10-20)');
Related
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I want to allow alphanumeric characters and periods; however, the phrase cannot contain more two or more periods in a row, it cannot start or end with a period, and spaces are not allowed.
I am using both PHP and Javascript.
So far, I have /^(?!.*\.{2})[a-zA-Z0-9.]+$/
This works for allowing alphanumeric characters and periods, while denying spaces and consecutive periods, but I still am not sure how to check for starting and/or ending periods. How might I do this? and, is there an even better way to do what I already have?
It nearly always helps to draw a finite state machine to conceptualize what your regular expression should look like.
^(?:\w\.?)*\w$
here's a possible way
/^(?!\.)((?:[a-z\d]|(?<!\.)\.)+)(?<!\.)$/i
for more explanations and tests see here: http://www.regex101.com/r/rZ6yH4
edit: according to tyler's solution, here's him way, shortened and reduced to letters and digits
/^(?:[a-z\d]+(?:\.(?!$))?)+$/i
( http://www.regex101.com/r/dL5aG0 )
A start would be:
/^[^. ](?!.*\.{2})[a-zA-Z0-9.]+[^. ]$/
but it should be tested carefully.
I’m trying to validate a string which contains numbers where each four numbers are separated by a hyphen, for example 1111-2222-3333-4444
I’m trying to do some kind of validating so I can guarantee that this format is being used (with 16 digits, three hyphens and nothing else). I’ve this preg_match where it checks for digits only but I need to accept hyphens and this format.
preg_match('/^[0-9]{1,}$/', $validatenumbers)
I’ve tried to do it with regex but unfortunately it isn’t my strongest side so I haven’t been able to correctly validate the numbers.
It is important that it is in PHP and not Javascript because of the ability to “turn off” javascript in a browser.
preg_match("/^([0-9]{4}-){3}[0-9]{4}$/", $input);
([0-9]{4}-){3} Matches exactly 3 groups of 4 digits followed by a hyphen. That is terminated by another group [0-9]{4} (4 digits without a hyphen).
preg_match('/^[0-9]{4}\-[0-9]{4}\-[0-9]{4}\-[0-9]{4}$/',$numbers);
i think that should work.
This looks like a credit card number. If that's the case, you should use a Luhn checksum instead of a simple regex.
try:
if(preg_match('#^\d{4}-\d{4}-\d{4}-\d{4}$#',$string){}
If you require to match that exact format the pattern would be '~^\d{4}-\d{4}-\d{4}-\d{4}$~', or you can write it more generally like this: '/^(\d+-)*\d+$/' (this would match 11, 11-11111... and so on),
I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).
Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?
I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.
I was able to write a translator that converts:
Nín hǎo. Wǒ shì zhōng guó rén ==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2
But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :
huā shíjiān, wèishénme, yuèláiyuè, shēngbìng, etc.
The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways: Chán gān or Cháng ān. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.
Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:
Create accent folding function
Create an array of valid pinyin (You should take it from the Wikipedia page for pinyin)
Match each word to the list of valid pinyin
Check ahead to the next word when there is ambiguity about the possibility of the last character belonging to the next word, such as:
shēngbìng
^ Does this 'g' belong to the next word?
Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.
Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.
wèishénme becomes wei4shen2me.
Remove diacritical marks by mapping "āáǎà" to "a", etc.
Using simple maximum matching algorithm, split compounds into syllables (there are only 418 or so Mandarin syllables).
Append numbers (you have to remember what kind of mark you removed) and joing syllables back into compounds.
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).