mysql or php comparing names

mysql or php comparing names - php

I have two tables, one with all the correct names of people in it, and then I have a table with all the correct names plus a bunch of names where there is some kind of misspell in the name, or using other characters.
For example, we got a name like Henry Muller, and then one with Henry Müller or Henry Mueller, and many other variations like that.
Is there some kind of mysql function that can compare those names to match 90% of the characters or something similiar? I know I cant match all the names to the correct ones, but I am hoping to get some of the way.
Its in a mysql database - but wouldnt mind getting the job done in php.
Thanks a whole bunch:)

I think this will accomplish what you're after, but you'll have to deal with the possibility of false matches
SELECT A.name, B.name
FROM TABLE_A A
INNER JOIN TABLE_B B
ON Soundex(A.name) = Soundex(B.name)

Related

Got my first project idea, am I on the right track?

I'm 14 and I am teaching myself PHP. I am going to build a parts database for my dads lawn mower repair store.
Each lawnmower has 'OEM' numbers, but the OEM parts are expensive. My dad has suppliers who make parts which are compatible with these OEM parts and in the back of the catalogues there are charts of OEM to the catalogs part number
I want to be able to enter the OEM number into a search field and then bring up every compatible catalog part number along with which catalog its from... I hope this is clear.
Right now I have extracted the data from the catalogs and have 6 database tables like this
id | oem | partnumber
------+------------+----------------------
| |
The table name is the catalog name.
I have a little coding experience (played around with arduinos and raspberry pi's for a while) so I have a little knowledge but never done anything with databases. I also know html and css, and a tiny bit of javascript.
I see these as the steps:
get OEM number from the search field.
search every table for the OEM
If there is a match return the part number and catalog
if there is no match display 'nothing found'
I kind of get it in my head, but I am totally lost in searching many tables with 1 query.... I can't do it with 'one' query because I need the catalog name (the table name) to be returned too... the easy option would be to make 1 table but I want to update this in the future and that would make it very hard.
I think I have to set the table names as an array and loop through the array running the query, but I can't get my head around how I would do this???
How exactly would you go about this?
Am I doing it wrong?
I could do many queries I guess but I was always taught that repeating code usually means you are doing it wrong, so I don't want to do that. I want to learn properly...
many thanks

You are on the right track. You might want to do a thing called normalisation in the database world. It basically means if you have something that is reference alot i.e. OEM, Catalog etc you might pull that out into a separate table and then you can reference it's id rather than value and thus you avoid duplication.
SQL is built for this kind of query (if you are interested you should look at set based logic and venn diagrams that will help in understanding sql).
So you might have a query that looks like this
SELECT
p.PartNumber,
p.Price,
c.Name
FROM
Parts p
INNER JOIN
OEM oem
ON
p.OEMId = oem.id
INNER JOIN
Catalog c
ON
c.Id = oem.CatalogId
WHERE
oem.slug = 'OEM123ABC'

You're doing it wrong. Put the catalog name in the table, if that's all there is to it. Then you do can do a SQL query like:
SELECT part_number FROM table_name WHERE catalog='catalog_name' AND oem='search_term';
Make sure to set the appropriate indices.

mysql distinct query using joins

I have a complex database relationship (to me its complex). In theory, I think it a good design, but my roadblock now is getting data out of it in as few queries as possible. Here is the database structure I have:
student table:
some fields like name, phone, email, etc.
students_requirements table (mapping table):
student_id,
requirement_id,
date
requirements table (belongs to a requirement type):
id,
requirement_type_id,
name
requirement_type table (has many requirements):
id,
type,
name,
Ok, so here is an example of how it is used. I can build requirement types. An example would be something like an assignment. Each assignment has multiple requirements. A student can pass off requirements for a specific assignment, but doesn't necessarily have requirements passed off for all assignments. So I would want to query all assignments by student. So say there are 50 assignments entered in the system, and jon smith has entered requirements for 4 of those assignments. I would like to query by jon smith id to find all assignments that he has entered any requirements for.
I hope that makes sense. My only guess is to use a join, but to be honest, I really don't understand them very well.
Any help would be awesome!

Try this:
SELECT * FROM student_table, students_requirements_table,
requirements_table, requirement_type_table
WHERE student_table.name = "Jon Smith"
AND students_requirements_table.id = student_table.id
AND requirements_table.id = students_requirements_table.requirement_id
AND requirement_type_table.id = requirements_table.requirement_type_id;
Check that the table names are accurate, as I've had to assume a couple of things (such as there being underscores in some of your table names), and note that all of the above should actually be one long line (but that makes it unreadable on this page, so I've split it across multiple lines).
I don't have a LAMP rig setup at the moment, so I can't mock this up to test it, and it's been a while since I had to write MySQL joins, but I think this is on the right track.
If you need to use LEFT JOIN then take a look at this page: Left joins to link three or more tables.

How to find similarity between mySQL rows?

I am trying to create a script that finds a matching percentage between my table rows. For example my mySQL database in the table products contains the field name (indexed, FULLTEXT) with values like
LG 50PK350 PLASMA TV 50" Plasma TV Full HD 600Hz
LG TV 50PK350 PLASMA 50"
LG S24AW 24000 BTU
Aircondition LG S24AW 24000 BTU Inverter
As you may see all of them have some same keyword. But the 1st name and 2nd name are more similar. Additionally, 3rd and 4th have more similar keywords between them than 1st and 2nd.
My mySQL DB has thousands of product names. What I want is to find those names that have more than a percentage (let's say 60%) of similarity.
For example, as I said, 1st, 2nd (and any other name) that match between them with more than 60%, will be echoed in a group-style-format to let me know that those products are similar. 3rd and 4th and any other with more than 60% matching will be echoed after in another group, telling me that those products match.
If it is possible, it would be great to echo the keywords that satisfy all the grouped matching names. For example LG S24AW 24000 BTU is the keyword that is contained in 3rd and 4th name.
At the end I will create a list of all those keywords.
What I have now is the following query (as Jitamaro suggested)
Select t1.name, t2.name From products t1, products t2
that creates a new name field next to all other names. Excuse me that I don't know how to explain it right but this is what it does: (The real values are product names like above)
Before the query
-name-
A
B
C
D
E
After the query
-name- -name-
A A
B A
C A
D A
E A
A B
B B
C B
D B
E B
.
.
.
Is there a way either with mySQL or PHP that will find me the matching names and extract the keywords as I described above? Please share code examples.
Thank you community.

Query the DB with LIKE OR REGEXP:
SELECT * FROM product WHERE product_name LIKE '%LG%';
SELECT * FROM product WHERE product_name REGEXP "LG";
Loop the results and use similar_text():
$a = "LG 50PK350 PLASMA TV 50\" Plasma TV Full HD 600Hz"; // DB value
$b = "LG TV 50PK350 PLASMA 50\"" ; // USER QUERY
$i = similar_text($a, $b, $p);
echo("Matched: $i Percentage: $p%");
//outputs: Matched: 21 Percentage: 58.3333333333%
Your second example matches 62.0689655172%:
$a = "LG S24AW 24000 BTU"; // DB value
$b = "Aircondition LG S24AW 24000 BTU Inverter" ; // USER QUERY
$i = similar_text($a, $b, $p);
echo("Matched: $i Percentage: $p%");
You can define a percentage higher than, lets say, 40%, to match products.
Please note that similar_text() is case SensItivE so you should lower case the string.

As for your second question, the levenshtein() function (in MySQL) would be a good candidate.

When I look at your examples, I consider how I would try to find similar products based on the title. From your two examples, I can see one thing in each line that stands out above anything else: the model numbers. 50PK350 probably doesn't show up anywhere other than as related to this one model.
Now, MySQL itself isn't designed to deal with questions like this, but some bolt-on tools above it are. Part of the problem is that querying across all those fields in all positions is expensive. You really want to split it up a certain way and index that. The similarity class of Lucene will grant a high score to words that rarely appear across all data, but do appear as a high percentage of your data. See High level explanation of Similarity Class for Lucene?
You should also look at Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Scoring each word against the Lucene similarity class ought to be faster and more reliable. The sum of your scores should give you the most related products. For the TV, I'd expect to see exact matches first, then some others of the same size, then brand, then TVs in general, etc.
Whatever you do, realize that unless you alter the data structures by using another tool on top of the SQL system to create better data structures, your queries will be too slow and expensive. I think Lucene is probably the way to go. Sphinx or other options not mentioned may also be up for consideration.

This is trickier than it seems and there is information missing in your post:
How are people going to use this auto-complete function?
Is it relevant that you can find all names for a product? Because apparently not all stores name their products similarly so a clerk might not be able to find the product (s)he found.
Do you have information about which product names are for the same product?
Is it relevant from which store you're searching? where is this auto-complete used?
Should the auto-complete really only suggest products that match all the words you typed? (it's not so hard, technically, to correct typos)
I think you need a more clear picture of what you (or better yet: the users) want this auto-complete function to do.
An auto-complete function is very much a user-friendly type feature. It aids the user, possibly in a fuzzy way so there is no single right answer. You have to figure out what works best, not what is easiest to do technically.
First figure out what you want, then worry about technology.

One possible solution is to use Damerau-Levenstein distance. It could be used like this
select *
from products p
where DamerauLevenstein(p.name, '*user input here*')<=*X*
You'll have to figure out X that suites your needs best. It should be integer greater than zero. You could have it hard-coded, parameterized or calculated as needed.
The trickiest thing here is DamerauLevenstein. It has to be stored procedure, that implements Damerau-Levenstein algorithm. I don't have MySQL here, so I might write it for you later this day.
Update: MySQL does not support arrays in stored procedures, so there is no way to implement Damerau-Levenstein in MySQL, except using temporary table for each function call. And that will result in terrible performance. So you have two options: loop through the results in PHP with levenstein like Alix Axel suggests, or migrate your database to PostgreSQL, where arrays are supported.
There is also an option to create User-Defined function, but this requires writing this function in C, linking it to MySQL and possibly rebuilding MySQL, so this way you'll just add more headache.

Your approach seems sound. For matching similar products, I would suggest a trigram search. There's a pretty decent explanation of how this works along with the String::Trigram Perl module.
I would suggest using trigram search to get a list of matches, perhaps coupled with some manual review depending on how much data you have to deal with and how frequent you need to add new products. I've found this approach to work quite well in practice.

Maybe you want to find the longest common substring from the 2 strings? Then you need to compute a suffix tree for each of your strings see here http://en.wikipedia.org/wiki/Longest_common_substring_problem.

If you want to check all names against each other you need a cross join in mysql. There are many ways to achieve this:
1. Select a, b From t1, t2
2. Select a, b From t1 Join t2
3. Select a, b From t1 Cross Join t2
Then you can loop through the result. This is the same when I say create a 2d array with n^2-(n-1) elements and each element is connected with each other.
P.S.: Select t1.name, t2.name From products t1, products t2

It sounds like you've gone through all this trouble to explain a complex scenario, then said that you want to ignore the optimal answers and just get us to give you the "handshake" protocol (everything is compared to everything that hasn't been compared to it yet). So... pseudocode:
select * from table order by id
while (result) {
select * from table where id > result_id
}
That will do it.

If your database simply had a UPC code as one of it's fields, and this field was well-maintained, i.e., you could trust that it was entered correctly by the database maintainer and correctly reflected what the item was -- then you wouldn't need to do all of the work you suggest.
An even better idea might be to have a UPC field in your next database -- and constrain it as unique.
Database users attempt to put an-already-existing UPC into the database -- they get an error.
Database maintains its integrity.
And if such a database maintained its integrity -- the necessity of doing what you suggest never arises.
This probably doesn't help much with your current task (apologies) -- but for a future similar database -- you might wish to think about it...

I`d advise you to use some fulltext search engine, like sphinx. It has possibilities to implement any algorithm you want. For example, you may use "quorom" or "any" searches.

It seems that you might always want to return the shortest string?? That's more or a question than anything. But then you might have something like...
SELECT * FROM products LIMIT 1
WHERE product_name like '%LG%'
ORDER BY LENGTH(product_name) ASC

This is a clustering problem, which can be resolved by a data mining method. ( http://en.wikipedia.org/wiki/Cluster_analysis) It requires a lot of memory and computation intensive operations which is not suitable for database engine. Otherwise, separate data mining, text mining, or business analytics software wouldn't have existed.

This question is similar :) to this one:
What is the best way to implement a substring search in SQL?
Trigram can easily find similar rows, and in that question i posted a php+mysql+trigram solution.

You can use LIKE to find similar product names within the table. For example:
SELECT * FROM product WHERE product_name LIKE 'LG%';

Here is another idea (but I'm voting for levenshtein()):
Create a temporary table of all words used in names and their frequencies.
Choose range of results (most popular words are probably words like LCD or LED, most unique words could be good, they might be product actual names).
Suggest for each of result words either:
results with those words
results containing longest substring (like this: http://forums.mysql.com/read.php?10,277997,278020#msg-278020 ) of those words.

Ok, I think I was trying to implement very much similar thing. It can work the same as the google chrome address box. When you type the address it gives you the suggestions. This is what you are trying to achieve as far I am concerned.
I cannot give you exact solution to that but some advice.
You need to implement the dropdown box where someone starts to enter the product they are looking for
Then you need to get the current value of the dropdown and then run query like guy posted above. Can be "SELECT * FROM product WHERE product_name LIKE 'LG%';"
Save results of the query
Refresh the page
Add the results of the query to the dropdown
Note:
You need to save the query results somewhere like the text file with the HTML code i.e. "option" LG TS 600"/option" (add <> brackets to option of course). This values will be used for populating your option box after the page refresh. You need to set up the users session for the user to get the same results for the same user, otherwise if more users would use the search at the same time it could clash. So, with the search id and session id you can match them then. You can save it in the file or the table. Table would be more convenient. It is actually in my sense the whole subsystem for that what are you looking for.
I hope it helps.

MySQL MATCH-AGAINST plural words?

I'd like to add a search to my site. I have a database of challenges from a video game. Each challenge has a title and description, I'd like to be able to search at least the description, but both if possible. Now, I've set the table up so that I can use MATCH() AGAINST(), but I'm having a problem with words that can be either singular or plural.
For example, the word "assists" appears in multiple challenges, but if the user types "assist", he won't get anything. Is there any way that I can add that functionalty? I've tried everything I can think of, but nothing has worked so far.
Update: I only just recently learned about MATCH AGAINST, so I'm not sure the "right" way to use it in my case. Like I said, I've got a table with a column called description, using the example word from above, assists, which appears multiple times in the table, I would use this query:
SELECT * FROM challenges WHERE MATCH(description) AGAINST('assists')
I just executed that and it returned 10 rows. If I change it to assist, I get nothing.

Maybe you can use the SOUNDEX() function from MySQL.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This wil match 'Saw' etc.
I think this also may be applicable for you and that he will take with plurals

I might be missing the point of what you're trying to do, but why not just do a LIKE match like this:
SELECT * FROM challenges WHERE description LIKE "%assist%";
That will match anything that contains "assist" as well as "assists". Of course it will also match "assistant" and "assistance", which may not be what you want...

MySQL Query Where Column Like Column

I'm working on a small project that involves grabbing a list of contacts which are stored for each group. Essentially, the database is set up so that each group has a primary and secondary contact stored as, unsurprisingly, Group.Primary and Group.Secondary. The objective is to pull every Primary and Secondary contact for each Group and display them in a sortable table.
I have the sortable table all worked out, but I have come across a small problem. Each primary and secondary field can have more than one contact separated by a comma. For instance, if Primary contained 123,256 , it would need to pull both Contacts with IDs 123 and 256. I had intended to use a query formatted like this:
SELECT *
FROM Group G,
Contacts C
WHERE G.Primary LIKE %C.ID%
OR G.Secondary LIKE %C.ID%
so that I could just skip the comma part, but I can't seem to find a working query for this.
My question to you is, am I just overlooking something here? Is there a simple query that would let me do this? Or am I better off getting the groups and contacts separately, and combine the two later. I think the former is a little easier to understand when read, which is a plus as this is a shared project, but if that is not possible I will do the latter.
This code is simplified, but it gets the point across.

If I understand correctly, you want to use the MySQL FIND_IN_SET function:
SELECT *
FROM Group G
JOIN Contacts C ON FIND_IN_SET(c.id, g.primary)
OR FIND_IN_SET(c.id, g.secondary)
But I highly recommend you normalize the table -- do not store comma delimited lists if at all possible.

I think you're definitely better off separating those two data values into different tables and then using JOINs to do your linking. If you were to, say, cast your id fields to strings so you could use the LIKE comparison, you'd end up with a bunch of junk matches. For example, if your primary id is 1, and your secondary is 35, then you'd match on the following (and this list is not exhaustive):
1: 1, 2: 35
1: 35, 2: 1
1: 10, 2: 135
1: 431, 2: 3541
etc.
What I'd do instead is something like this:
SELECT *
FROM Group G
LEFT JOIN Contacts c1 on g.primary = c1.id
LEFT JOIN Contacts c2 on g.secondary = c2.id
WHERE
c1.id IS NOT NULL
OR
c2.id IS NOT NULL
I think that'll get you the data you're really looking for, if I understand the question correctly.

Satan has been in your database, denormalizing it and dooming you to a life of complex and slow queries.
Do you have the ability to alter the structure of the database? If it's in production, I assume not. Failing that, you might want to consider creating a normalized table of primary and secondary contacts immediately prior to running this report.
If you can't do that, you need to work out a string matching algorithm that will always work. The problem with the one that you proposed is that you need to consider a contact id of 23 (or even 3), which will match 23, 123, 223, 231, and so on. To make that work, you need to add commas to the beginning and ending of both strings you're comparing and then do the LIKE.
Oops. Or you can use the I-never-knew FIND_IN_SET function described by Ponies, above.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.