Compute a percent with Sphinxsearch - php

With Sphinxsearch, how could I display a percent of the keywords matching the results?
For example, I have these two lines in my users table :
Paul Smith, Belgium
Maher AbouAbbas, Russian Federation
If the query is "Maher Russian Belgium", I want to display :
[33%] Paul Smith, Belgium (Belgium matches)
[66%] Maher AbouAbbas, Russian Federation (Maher and Russian matches)

A rudimentary example that first comes to mind is to simply return the results, then explode them and check each word for words in the query string (also exploded) to compute the percent of words found in the item that are also in the query string.

Have a look at this. Maybe that's what you are searching for.

You may want to look into the Levenshtein string distance algorithm, which is implemented natively in PHP: http://php.net/levenshtein

Related

Determine Country from Telephone Numbers

I have seen a few question on SO similar to what I require but nothing seems to fit the bill.
I am in the position where I need to deal with a call record and determine the country using the phone number. The number dialed can be any country for example:
44 7899455120 - UK
34 965791845 - Spain
355 788415235 - Albania
Obviously the world would be great if all calling codes were two digits but this is not the case. Currently I have a database full of countries and their relevant codes and in order to match I need to effectively take the first digit of the number ie 4 for the UK example and do a query of the database eg:
SELECT * from countries WHERE code LIKE '4%'
This may give me for example 20 results. So I loop again and do say
SELECT * from countries WHERE code LIKE '44%'
This may give me say one result, now I can determine it is UK. Some codes however like Albania are three digits and require more loops and database calls. This seems quite rudimentary and inefficient but as is I cannot think of another way to achieve this. I realise three calls to a database may not seem like much but if you have 1000 calls to deal with they soon add up.
Looking at the following question:
What regular expression will match valid international phone numbers?
There seems to be some great information on validating a number against country codes, but not so much on determining the country code from a number. Any advice or suggestions on a cleaner method would be much appreciated.
Spaces in the phone are shown for clarity
A library exists that will parse a string of digits and reformat it to international standards (a number like 4402081231234 to '+44 20 8123 1234'). It will also return the Phone Number region, 'GB' or 'US' from a number, if there is the country code embedded in the number.
https://github.com/googlei18n/libphonenumber The original library is in Java, but there are also versions in Javascript, Python, Ruby and PHP, among others.
There is no overlap ambiguity in the country codes. Meaning: the country code 11 is illegal because 1 is assigned to North America. Similarly, 20 is Egypt and there are no other country codes that start with 20. And the country codes that start with 21 are all 3 digits.
Since the is no overlap ambiguity, you can directly search for the country code in one query for the phone number 12125551212 like this:
select country
, code
from countrycodes
where code in ('121', '12', '1')
Again, there are no country codes 121 or 12, so the only criteria that will match is the 1.
Assuming the phone will always look like that:
$phone = "355 788415235"; // Albania
$parts = explode(" ", $phone);
$code = $parts[0]; // First part separated by space, 355.
Then query by that directly. No regular expression needed.
If that's not the case, consider separating the country code from the number on the input level.
On your system, every phone number has white space after country code so you can use it to determine country.
Create a table which has all country codes. Lıke
id | country | code
1 | Turkey | 90
2 | Spain | 34
(There is a table for you: http://erikastokes.com/mysql-help/country.sql.txt )
Than explode your phone number. Delimeter is white space " ".
$phoneNumber = "355 788415235";
$countryCode = explode(" ",$phoneNumber); // it divides phone number to two parts.
$countryCode = $countryCode[0]; // it returns 355. We write index 0 because country code is first part.
//Now you can call your country by country code.
$sqlQuery ="SELECT country FROM yourTableName WHERE code = '$countryCode' ";
...
//it will works like a charm. Because i currently using this.

Fetching similar sounding names from a table

I have a student table namely stu_table and student name field is stu_name.
In this table there are so many student Like Mrinmoy, Minmoy ,Minmay,Mrinmay,Tanmay,Rajesh,Susanta,Bireshwar etc.
I would like to fetch those student, whose name sound like Mrinmoy
You could use MySQL SOUNDEX:
SELECT * FROM `stu_table` WHERE STRCMP(SOUNDEX(`stu_name`), SOUNDEX('Mrinmoy')) <= 0
But I don't think it is very accurate and it's very limited.
SQLFIDDLE
Double Metaphone is a SOUNDEX-like hash algorithm for imprecise matching of Roman-alphabet, English-pronunciation proper-name text. It works tolerably well for other single words besides names.
The Double Metaphone hash algorithm generates either one or two hash values for a word. That's what makes it "double." For example, there's a village in Massachusetts USA called "Gill". It has the two metaphone hashes with values KL and JL, corresponding to two different pronunciations.
Now, if somebody hears the word "Jill" for that village's name, they'll ask for its metaphone hashes. They are JL and AL. To find this match, the double metaphone search must look at four possible matches:
Gill Jill
KL JL mismatch
KL AL mismatch
JL JL match!
JL AL mismatch
Therefore, "Gill" and "Jill" are considered matching by double metaphone.
Many words only have one metaphone hash. Those are easier to match.
A MySQL stored function to generate the metaphone hashes can be found here.
http://www.atomodo.com/code/double-metaphone/
But beware: given a word with two metaphone hashes it returns them in one string separated by a semicolon.
Like the ancient and honorable SOUNDEX, Double Metaphone favors false positive matches rather than false negative. But it has better rates on both, mostly due to its double-hash capability.
Mysql has operator SOUNDS LIKE
Try look at it
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#operator_sounds-like

Generate words (car brands/models) with mistakes

I am developing a fuzzy search mechanism. I have car brands/models and cities in database (mysql)(english and russian names) - about 1000 items. User can enter this words with mistakes or in translit. Now I am retrieving all these words from db and compare each word in loop with user entered word (using livenstein distance and other functions).
Is there any way to generate many forms of each word (car brands/models) + words with mistakes, because I want to retrieve these words from db (using like sql operator). For example: I have car brand: Toyota and I want to generate - Tokota, Tobota, Toyoba, Tayota, Тойота, Токота, Тобота (russian) - many many forms of each word. And user can enter any of this word and I can find that it is Toyota he means.
Well, there is a function called SOUNDEX in MySQL. I don't know it is what you need.
For example:
SELECT SOUNDEX('Toyyota') == SOUNDEX('Toyota')
Here is from the MySQL Document
Returns a soundex string from str. Two strings that sound almost the
same should have identical soundex strings. A standard soundex string
is four characters long, but the SOUNDEX() function returns an
arbitrarily long string. You can use SUBSTRING() on the result to get
a standard soundex string. All nonalphabetic characters in str are
ignored. All international alphabetic characters outside the A-Z range
are treated as vowels.
This function, as currently implemented, is intended to work well with
strings that are in the English language only. Strings in other
languages may not produce reliable results.
Reference: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

Searching for phonenumber in database with regex

I need to search in a database for a phonenumber. However, I don't know how the phone number is stored in the database, this can be in different ways, like:
0123456789
012 3456789
012 34 56 78 9
012-3456789
The string that I have to look for is always formatted like
0123456789
My query now looks like:
SELECT * FROM account WHERE phonenumber = '0123456789'
But this ofcourse only works when the phonenumber is formatted like the search string. How do I use a regex of other function to search for all kind of formatted phonenumbers?
Use Mysql REGEXP. This is a basic example of how You can achieve that. Works with every number format in Your example.
Think of better regexp to be more precise.
SELECT * FROM account WHERE phonenumber REGEXP '012( |-|)34( |)56( |)78( |)9';
While it is possible to perform REGEXP's in MySql, I don't think its a very solid solution and the expression will be hard to tune.
SELECT * FROM account WHERE phonenumber REGEXP '^([0-9]{1}( |-|)){9}[0-9]{1}$'
A good tool to test expressions is this site: http://www.spaweditor.com/scripts/regex/index.php
The trick is to normalize your data before you enter in in your database which means strip the telephone number of all non numeric characters.
You'll need to fix the numbers already in the table too.
Best way is to store the data normalized, i.e.: Country + area + number separately .

Name comparison algorithm

To check if a name is inside an anti-terrorism list.
In addition of the given name, also search for similar names (possible aliases).
Example:
given name => Bin Laden alert!
given name => Ben Larden mhm.. suspicious name, matchs at xx% with Bin Laden
How can I do this?
using PHP
names are 100% correct, since they are from official sources
i'm Italian, but i think this won't be a problem, since names are international
names can be composed of several words: Najmiddin Kamolitdinovich JALOLOV
looking for companies and people
I looked at differents algorithms: do you think that Levenshtein can do the job?
thank you in advance!
ps i got some problems to format this text, sorry :-)
I'd say your best bet to get this working with PHP's native functions are
soundex() — Calculate the soundex key of a string
levenshtein() - Calculate Levenshtein distance between two strings
metaphone() - Calculate the metaphone key of a string
similar_text() - Calculate the similarity between two strings
Since you are likely matching the names against a database (?), you might also want to check whether your database provides any Name Matching Functions.
Google also provided a PDF with a nice overview on Name Matching Algorithms:
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
The Levenshtein function (http://php.net/manual/en/function.levenshtein.php) can do this:
$string1 = 'Bin Laden';
$string2 = 'Ben Larden';
levenshtein($string1, $string2); // result: 2
Set a threshold on this result and determine if the name looks similar.

Categories