I am developing a fuzzy search mechanism. I have car brands/models and cities in database (mysql)(english and russian names) - about 1000 items. User can enter this words with mistakes or in translit. Now I am retrieving all these words from db and compare each word in loop with user entered word (using livenstein distance and other functions).
Is there any way to generate many forms of each word (car brands/models) + words with mistakes, because I want to retrieve these words from db (using like sql operator). For example: I have car brand: Toyota and I want to generate - Tokota, Tobota, Toyoba, Tayota, Тойота, Токота, Тобота (russian) - many many forms of each word. And user can enter any of this word and I can find that it is Toyota he means.
Well, there is a function called SOUNDEX in MySQL. I don't know it is what you need.
For example:
SELECT SOUNDEX('Toyyota') == SOUNDEX('Toyota')
Here is from the MySQL Document
Returns a soundex string from str. Two strings that sound almost the
same should have identical soundex strings. A standard soundex string
is four characters long, but the SOUNDEX() function returns an
arbitrarily long string. You can use SUBSTRING() on the result to get
a standard soundex string. All nonalphabetic characters in str are
ignored. All international alphabetic characters outside the A-Z range
are treated as vowels.
This function, as currently implemented, is intended to work well with
strings that are in the English language only. Strings in other
languages may not produce reliable results.
Reference: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Related
I have so many Unicode strings and want to store them in MySQL database. Also I want to add an extra field such that represents the character identity of the string. For example:
String key
------ -----------
this is 1st string 113547858
this is first string 113547865
I go to school 524872354
As you may have noticed above, the first 2 keys are so close to each other, representing strings similarity, whereas the 3rd one is so far from them.
I don't want to use PHP's similar_text or levenshtein as they need two strings to check similarity, but I want to store a value for each single string to store in DB in order to put an index on it for future use.
Simple summation of the character codes of all characters of the string can be a solution?
Update:
Summation of a hash value at the level of every word of the string can also be a solution
I have a student table namely stu_table and student name field is stu_name.
In this table there are so many student Like Mrinmoy, Minmoy ,Minmay,Mrinmay,Tanmay,Rajesh,Susanta,Bireshwar etc.
I would like to fetch those student, whose name sound like Mrinmoy
You could use MySQL SOUNDEX:
SELECT * FROM `stu_table` WHERE STRCMP(SOUNDEX(`stu_name`), SOUNDEX('Mrinmoy')) <= 0
But I don't think it is very accurate and it's very limited.
SQLFIDDLE
Double Metaphone is a SOUNDEX-like hash algorithm for imprecise matching of Roman-alphabet, English-pronunciation proper-name text. It works tolerably well for other single words besides names.
The Double Metaphone hash algorithm generates either one or two hash values for a word. That's what makes it "double." For example, there's a village in Massachusetts USA called "Gill". It has the two metaphone hashes with values KL and JL, corresponding to two different pronunciations.
Now, if somebody hears the word "Jill" for that village's name, they'll ask for its metaphone hashes. They are JL and AL. To find this match, the double metaphone search must look at four possible matches:
Gill Jill
KL JL mismatch
KL AL mismatch
JL JL match!
JL AL mismatch
Therefore, "Gill" and "Jill" are considered matching by double metaphone.
Many words only have one metaphone hash. Those are easier to match.
A MySQL stored function to generate the metaphone hashes can be found here.
http://www.atomodo.com/code/double-metaphone/
But beware: given a word with two metaphone hashes it returns them in one string separated by a semicolon.
Like the ancient and honorable SOUNDEX, Double Metaphone favors false positive matches rather than false negative. But it has better rates on both, mostly due to its double-hash capability.
Mysql has operator SOUNDS LIKE
Try look at it
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#operator_sounds-like
I am in the process of learning MySQL and querying, and right now working with PHP to begin with.
For learning purposes I chose a small anagram solver kind of project to begin with.
I found a very old English language word list on the internet freely available to use as the DB.
I tried querying, find in set and full-text search matching but failed.
How can I:
Match a result letter by letter?
For example, let's say that I have the letters S-L-A-O-G to match against the database entry.
Since I have a large database which surely contains many words, I want to have in return of the query:
lag
goal
goals
slag
log
... and so on.
Without having any other results which might have a letter used twice.
How would I solve this with SQL?
Thank you very much for your time.
$str_search = 'SLAOG';
SELECT word
FROM table_name
WHERE word REGEXP '^[{$str_search}]+$' # '^[SLAOG]+$'
// Filter the results in php afterwards
// Loop START
$arr = array();
for($i = 0; $i < strlen($row->word); $i++) {
$h = substr($str_search, $i, 0);
preg_match_all("/{$h}/", $row->word, $arr_matches);
preg_match_all("/{$h}/", $str_search, $arr_matches2);
if (count($arr_matches[0]) > count($arr_matches2[0]))
FALSE; // Amount doesn't add up
}
// Loop END
Basicly run a REGEXP on given words and filter result based on how many occurencies the word compared with the search word.
The REGEXP checks all columns, from beginning to end, with a combination of given words. This may result in more rows then you need, but it will give a nice filter nonetheless.
The loop part is to filter words where a letter is used more times then in the search string. I run a preg_match_all() on each letter in found the word and the search word to check the amount of occurencies, and compare them with count().
If you want a quick and dirty solution....
Split the word you're trying to get anagrams for into individual letters. Assign each letter an individual prime number value, and multiply them all together; eg:
C - 2
A - 3
T - 5
For a total of 30
Then step through your dictionary list, and do the same operation on each word in that. If your target word's value is divisible exactly by the dictionary word's value, then you know that the dictionary word has only letters that occur in your target word.
You can speed it up by pre-calculating the dictionary values, and then querying for just the right values:
SELECT * FROM dictionary WHERE ($searchWordTotal % wordTotal) = 0
(searchWordTotal is the total for the word you're looking for, and wordTotal is the one from the database)
I should get around to writing this properly one of these days....
since you only want words with the letters given, and no others, but you dont need to use all the letters, then i suggest logic like this:
* take your candidate word,
* do a string replace of the first occurrence of each letter in your match set,
* set the new value to null
* then finally wrap all that in a strlength to see if there are any characters left.
you can do all that in sql - but a little procedure will probably look more familiar to most coders.
So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word. For example, in the word "inward" there are the words "rain", "win", "rid", etc.
At first I thought about adding a field to the Words tables (Words3 through Words20, denoting the number of letters in the words), something like "LetterCount"... for example, "rally" would be represented as 10000000000200000100000010: 1 instances of the letter A, 0 instances of the letter B, ... 2 instances of the letter L, etc. Then, go through all the words in each table (or one table if the target length of found words was specified) and compare the LetterCount of each word to the LetterCount of the source word ("inward" in the example above).
But then I started thinking that that would place too much of a load on the MySQL database as well as the PHP script, calling each and every word's LetterCount, comparing each and every digit to that of the source word, etc.
Is there an easier, perhaps more intuitive way of doing this? I'm open to using stored procedures if it will help with overhead in any way. Just some suggestions would be greatly appreciated. Thanks!
Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):
Assign each letter a prime number according to it's frequency. So e is 2, t = 3, a = 5, etc. using frequency values from here or some similar source.
Precalculate the value of each word in your word list by multiplying the prime values for the letters in the word, and store in the table in a bigint data type column. For instance, tea would have a value of 3*2*5=30. If a word has repeated letters, repeat the factor, so that teat should have a value of 3*2*5*3=90.
When checking if a word, such as rain, is contained inside of another word, such as inward, it's sufficient to check if the value for rain divides the value for inward. In this case, inward = 14213045, rain = 7315, and 14213045 is divisible by 7315, so the word rain is inside the word inward.
A bigint column maxes out at 9223372036854775807, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which is anitinstitutionalism, and has a value of 6901041299724096525 which would just barely fit inside the bigint column. However, the 14-letter word xylopyrography has a value of 635285791503081662905, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.
The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8
To check if a name is inside an anti-terrorism list.
In addition of the given name, also search for similar names (possible aliases).
Example:
given name => Bin Laden alert!
given name => Ben Larden mhm.. suspicious name, matchs at xx% with Bin Laden
How can I do this?
using PHP
names are 100% correct, since they are from official sources
i'm Italian, but i think this won't be a problem, since names are international
names can be composed of several words: Najmiddin Kamolitdinovich JALOLOV
looking for companies and people
I looked at differents algorithms: do you think that Levenshtein can do the job?
thank you in advance!
ps i got some problems to format this text, sorry :-)
I'd say your best bet to get this working with PHP's native functions are
soundex() — Calculate the soundex key of a string
levenshtein() - Calculate Levenshtein distance between two strings
metaphone() - Calculate the metaphone key of a string
similar_text() - Calculate the similarity between two strings
Since you are likely matching the names against a database (?), you might also want to check whether your database provides any Name Matching Functions.
Google also provided a PDF with a nice overview on Name Matching Algorithms:
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
The Levenshtein function (http://php.net/manual/en/function.levenshtein.php) can do this:
$string1 = 'Bin Laden';
$string2 = 'Ben Larden';
levenshtein($string1, $string2); // result: 2
Set a threshold on this result and determine if the name looks similar.