Fetching similar sounding names from a table - php

I have a student table namely stu_table and student name field is stu_name.
In this table there are so many student Like Mrinmoy, Minmoy ,Minmay,Mrinmay,Tanmay,Rajesh,Susanta,Bireshwar etc.
I would like to fetch those student, whose name sound like Mrinmoy

You could use MySQL SOUNDEX:
SELECT * FROM `stu_table` WHERE STRCMP(SOUNDEX(`stu_name`), SOUNDEX('Mrinmoy')) <= 0
But I don't think it is very accurate and it's very limited.
SQLFIDDLE

Double Metaphone is a SOUNDEX-like hash algorithm for imprecise matching of Roman-alphabet, English-pronunciation proper-name text. It works tolerably well for other single words besides names.
The Double Metaphone hash algorithm generates either one or two hash values for a word. That's what makes it "double." For example, there's a village in Massachusetts USA called "Gill". It has the two metaphone hashes with values KL and JL, corresponding to two different pronunciations.
Now, if somebody hears the word "Jill" for that village's name, they'll ask for its metaphone hashes. They are JL and AL. To find this match, the double metaphone search must look at four possible matches:
Gill Jill
KL JL mismatch
KL AL mismatch
JL JL match!
JL AL mismatch
Therefore, "Gill" and "Jill" are considered matching by double metaphone.
Many words only have one metaphone hash. Those are easier to match.
A MySQL stored function to generate the metaphone hashes can be found here.
http://www.atomodo.com/code/double-metaphone/
But beware: given a word with two metaphone hashes it returns them in one string separated by a semicolon.
Like the ancient and honorable SOUNDEX, Double Metaphone favors false positive matches rather than false negative. But it has better rates on both, mostly due to its double-hash capability.

Mysql has operator SOUNDS LIKE
Try look at it
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#operator_sounds-like

Related

Finding out sequence similarity in arrays

I have a task where I have three arrays A,B,C. All of the contain the same data. For the sake of simplicity lets assume the data is numbers 1 to 5. The data would be in different jumbled sequences. I want to find out among B & C which array has data most similar to A.
Eg:
A = 1,2,3,4,5
B = 1,2,3,5,4
C = 4,1,2,3,5
In this case, it is easy to visually comprehend that B is more similar to A. But it gets more complicated for really jumbled sequences.
Eg:
A = 1,2,3,4,5
B = 5,3,1,4,2
C = 4,1,2,3,5
In this case, I would assume C to be more closer to A. I am thinking that this assumption can be quantified as: How many elements have the same sequence in both arrays? In above example the subsequence of [1,2,3] is the same in both arrays. The second question would be what is the offset difference between the similar subsequence ? In this case it is 1, because the subsequence begins at index 0 for A and index 1 for C.
So the number of elements in a matching sequence and their offsets are what I am thinking to use. I plan on adding a weightage to these two entities (number of elements in matching sequence, and offset difference in their occurrence)
Does this make sense? I only need a rough approximation of similarity and the results do not need to be exact. Are there any formal mathematical or data-structure models that solve this problem?
BTW, the project where I need this implemented is in PHP. Does it have any inbuilt functions like the levenstein model for string difference?
Any suggestions are very welcome!
Well I suppose you can come up with your own algorithm (for instance generate all suffixes and then search for them and then define a scoring procedure) or you could use a well known algorithm like
Smith-Waterman for local alignment or Needleman-Wunsch for global. The advantage of these algorithms is that they are well-understood and give you all the possible alignments (and you can choose the best for your case).
NW in PHP
SW in PHP

Generate words (car brands/models) with mistakes

I am developing a fuzzy search mechanism. I have car brands/models and cities in database (mysql)(english and russian names) - about 1000 items. User can enter this words with mistakes or in translit. Now I am retrieving all these words from db and compare each word in loop with user entered word (using livenstein distance and other functions).
Is there any way to generate many forms of each word (car brands/models) + words with mistakes, because I want to retrieve these words from db (using like sql operator). For example: I have car brand: Toyota and I want to generate - Tokota, Tobota, Toyoba, Tayota, Тойота, Токота, Тобота (russian) - many many forms of each word. And user can enter any of this word and I can find that it is Toyota he means.
Well, there is a function called SOUNDEX in MySQL. I don't know it is what you need.
For example:
SELECT SOUNDEX('Toyyota') == SOUNDEX('Toyota')
Here is from the MySQL Document
Returns a soundex string from str. Two strings that sound almost the
same should have identical soundex strings. A standard soundex string
is four characters long, but the SOUNDEX() function returns an
arbitrarily long string. You can use SUBSTRING() on the result to get
a standard soundex string. All nonalphabetic characters in str are
ignored. All international alphabetic characters outside the A-Z range
are treated as vowels.
This function, as currently implemented, is intended to work well with
strings that are in the English language only. Strings in other
languages may not produce reliable results.
Reference: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

Levenshtein - grouping hotel names

I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped.
Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}
I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like #saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.
what levenshtein distance value do you take as the delta between words to be treated as part of same group ? Seems that you tend to group hotels based on the initial few words and that will require a different approach altogether (like do dictionary sort , compare current string with next strings etc). However if your use-case still requires to calculate levenshtein distance then I would suggest you to sort the Strings based on their length and then start comparing each string with other strings of similar length (apply you own heuristic to what you consider as 'similar' like you may say isSimilar = Math.abs(str1.length - str2.length) < SOME_LOWEST_DELTA_VALUE or something like that)
You might want to read about http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Cluster_analysis in general.

Compute a percent with Sphinxsearch

With Sphinxsearch, how could I display a percent of the keywords matching the results?
For example, I have these two lines in my users table :
Paul Smith, Belgium
Maher AbouAbbas, Russian Federation
If the query is "Maher Russian Belgium", I want to display :
[33%] Paul Smith, Belgium (Belgium matches)
[66%] Maher AbouAbbas, Russian Federation (Maher and Russian matches)
A rudimentary example that first comes to mind is to simply return the results, then explode them and check each word for words in the query string (also exploded) to compute the percent of words found in the item that are also in the query string.
Have a look at this. Maybe that's what you are searching for.
You may want to look into the Levenshtein string distance algorithm, which is implemented natively in PHP: http://php.net/levenshtein

Name comparison algorithm

To check if a name is inside an anti-terrorism list.
In addition of the given name, also search for similar names (possible aliases).
Example:
given name => Bin Laden alert!
given name => Ben Larden mhm.. suspicious name, matchs at xx% with Bin Laden
How can I do this?
using PHP
names are 100% correct, since they are from official sources
i'm Italian, but i think this won't be a problem, since names are international
names can be composed of several words: Najmiddin Kamolitdinovich JALOLOV
looking for companies and people
I looked at differents algorithms: do you think that Levenshtein can do the job?
thank you in advance!
ps i got some problems to format this text, sorry :-)
I'd say your best bet to get this working with PHP's native functions are
soundex() — Calculate the soundex key of a string
levenshtein() - Calculate Levenshtein distance between two strings
metaphone() - Calculate the metaphone key of a string
similar_text() - Calculate the similarity between two strings
Since you are likely matching the names against a database (?), you might also want to check whether your database provides any Name Matching Functions.
Google also provided a PDF with a nice overview on Name Matching Algorithms:
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
The Levenshtein function (http://php.net/manual/en/function.levenshtein.php) can do this:
$string1 = 'Bin Laden';
$string2 = 'Ben Larden';
levenshtein($string1, $string2); // result: 2
Set a threshold on this result and determine if the name looks similar.

Categories