Name comparison algorithm - php

To check if a name is inside an anti-terrorism list.
In addition of the given name, also search for similar names (possible aliases).
Example:
given name => Bin Laden alert!
given name => Ben Larden mhm.. suspicious name, matchs at xx% with Bin Laden
How can I do this?
using PHP
names are 100% correct, since they are from official sources
i'm Italian, but i think this won't be a problem, since names are international
names can be composed of several words: Najmiddin Kamolitdinovich JALOLOV
looking for companies and people
I looked at differents algorithms: do you think that Levenshtein can do the job?
thank you in advance!
ps i got some problems to format this text, sorry :-)

I'd say your best bet to get this working with PHP's native functions are
soundex() — Calculate the soundex key of a string
levenshtein() - Calculate Levenshtein distance between two strings
metaphone() - Calculate the metaphone key of a string
similar_text() - Calculate the similarity between two strings
Since you are likely matching the names against a database (?), you might also want to check whether your database provides any Name Matching Functions.
Google also provided a PDF with a nice overview on Name Matching Algorithms:
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

The Levenshtein function (http://php.net/manual/en/function.levenshtein.php) can do this:
$string1 = 'Bin Laden';
$string2 = 'Ben Larden';
levenshtein($string1, $string2); // result: 2
Set a threshold on this result and determine if the name looks similar.

Related

Full text document similarity search

I have big database of articles and I'd like before adding new items to DB check if already similar items exist and if so - group them together, so that later I can easily display them as a group of similar items.
Currently we use very simple, but shockingly very precise and our needs fully satisfying PHP's similar_text() function. The problem is, that before we add an item to DB, we first need to pull X amount of items from DB to then loop through every single one in order to check whether our new item is at least 75% similar to other items in order to group them together. This uses a lot of resources and time that we don't really have.
We use MySQL and Solr for all our queries. I've tried using MySQL Full-Text Search, Solr More like this. Compared to PHPs implementation, they are super fast and efficient, but I just can't get a robust percentage score which PHP similar_text() provides. It is crucial for our grouping to be accurate.
For example using this MySQL query:
SELECT id, body, ROUND(((MATCH(body) AGAINST ('ARTICLE TEXT')) / scores.max_score) * 100) as relevance
FROM natural_text_test,
(SELECT MAX(MATCH(body) AGAINST('ARTICLE TEXT')) as max_score FROM natural_text_test LIMIT 1) scores
HAVING relevance > 75
ORDER BY relevance DESC
i get that article with 130 words is 85% similar with another article with 4700 words. And in comparison PHP's similar_text() returns only 3% similarity score which is well below our threshold and is correct in our case.
I've also looked into Levenshtein distance algorithm, but it seems that the same problem as with MySQL and Solr arises.
There has to be a better way to handle similarity checks, maybe I'm using the algorithms incorrectly?
Based on some of the Comments, I might propose this...
It seems that 75%-similar documents would have a lot of the same sentences in the same order.
Break the doc into sentences
Take a crude hash of each sentence, map it to a visible ascii character. This gives you a string that is, perhaps, 1/100th the size of the original doc.
Store that with the doc.
When searching, use levenshtein() on this string to find 'similar' documents.
Sure, hashing is imperfect, etc. But this is fast. And you could apply some other technique to double-check the few docs that are close.
For a hash, I might do
$md5 = md5($sentence);
$x = somehow get 6 bits out of that hex string
$hash = chr(ord('0' + $x));

Generate words (car brands/models) with mistakes

I am developing a fuzzy search mechanism. I have car brands/models and cities in database (mysql)(english and russian names) - about 1000 items. User can enter this words with mistakes or in translit. Now I am retrieving all these words from db and compare each word in loop with user entered word (using livenstein distance and other functions).
Is there any way to generate many forms of each word (car brands/models) + words with mistakes, because I want to retrieve these words from db (using like sql operator). For example: I have car brand: Toyota and I want to generate - Tokota, Tobota, Toyoba, Tayota, Тойота, Токота, Тобота (russian) - many many forms of each word. And user can enter any of this word and I can find that it is Toyota he means.
Well, there is a function called SOUNDEX in MySQL. I don't know it is what you need.
For example:
SELECT SOUNDEX('Toyyota') == SOUNDEX('Toyota')
Here is from the MySQL Document
Returns a soundex string from str. Two strings that sound almost the
same should have identical soundex strings. A standard soundex string
is four characters long, but the SOUNDEX() function returns an
arbitrarily long string. You can use SUBSTRING() on the result to get
a standard soundex string. All nonalphabetic characters in str are
ignored. All international alphabetic characters outside the A-Z range
are treated as vowels.
This function, as currently implemented, is intended to work well with
strings that are in the English language only. Strings in other
languages may not produce reliable results.
Reference: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

MySql Php Find Similar Values

I run a website where users have a username. They can change their usernames whenever they want. When them change their name, we check that that name isn't currently being used and then allow or not allow the change. On our site people often like to change their username to copy other peoples (make their name very similar to confuse other people of their identity). This isn't uncommon for the type of site we run.
Is way to easily check for usernames that are somewhat similar using a simple query?
Here are some examples of usernames that we would like to have a query match up.
testingman1 = testingman11
lionhead = Iionhead (one has an l and the other has a capital i)
sleepybears = sleeepybears
Any way to do a character by character count of the same letters in the same position and then determine based on the percentage if it is a copy of another user?
I know I'll most likely have to write a custom function, but just looking for some advice on how to make it as painless and not very system taxing process.
You can use
levenshtein(str1, str2) that will return an integer witch is the distance between the two strings.
In PHP if one string is longer that 255 characters the function will return -1.
More info: http://php.net/manual/en/function.levenshtein.php
or if you want in percent you can use similar_text ( string $first , string $second, [, float &$percent ] )
witch pass in the 3rd parameter the percent of similarity
More info: http://www.php.net/manual/en/function.similar-text.php

Convert numbers to another one in the same range, and back

I'm looking for a solution to convert all numbers in a given range to another number in the same range, and later convert that number back.
More concrete, let's say I have the numbers 1..100.
The easiest way to convert all numbers to another one in the same range is to use: b = 99 -a; later get the original with a = 99 - b;.
My problem is that I want to simulate some randomness.
I want to implement this in PHP, but the coding language doesn't matter.
WHY?
You maybe say why? Good question :)
I am generating some easy to read short code string based on id-s, and because the id's are incremented one by one, my consecutive short codes are too similar.
Later I need to "decode" the short codes, to get the id.
What my algorithm is doing now is:
0000001 -> ababac, 0000002 -> ababad, 0000003 -> ababaf, etc.
later
ababac -> 0000001, ababad -> 0000002, ababaf -> 0000003, etc.
So before I actually generate the short code I want to "randomize" the number as much as possible.
Option 1:
Why dont you just have a database of conversion? i.e each record has a "real" id, and a "random md5" string or something
Option 2:
Use a rainbow table - maybe even a MD5 lookup table for the range 0 - 10,000 or whatever. Then just do a hashtable lookup
Finally I found a solution based on module operator, on the math forum.
The solution can be found here:
https://math.stackexchange.com/questions/259891/function-to-convert-each-number-in-a-m-n-to-another-number-in-the-same-range

Select alphabetically "nearest" option from a dropdown

I have a list of words in a dropdown and I have a single word that is looking for a suiting partner(user is choosing it)
To make this easier for the user(because the list can be very long and the porcess has to be fast) I want to give a possible option.
I already had a look how i van change the selected word.
I want to find the alphabetically "nearest" option but i have no idear how i could find out which word is the nearest neigbore....
I already googled with all words I could think of to get a solution but I couldn´t find something.
Does someone have an idear how i can do it?
The levenshtein function will compute the 'closeness' of 2 string. You could rank the words you have relative to user's string and return the string with the lowest value.
have a look at this library, it contains Fuzzy string matching functions for javascript, including stemming, lehvenstein distance and metaphones: http://code.google.com/p/yeti-witch/
If by alphabetically you mean matching letters read from the left, the answer is easy. Simply go through every letter of the word and compare it with the ones in the select drop down. The word that shares the longest starting substring is your "nearest".
The simplest (and probably fastest) thing in javascript is finding (by binary search) where to put the word in sorted array of your option words using < and > string operators.
For more advanced and precise results, use Levenshtein distance

Categories