MySQL Remove/Combine Similar Rows - php

I've got a problem that I just can't seem to find the answer to. I've developed a very small CRM-like application in PHP that's driven by MySQL. Users of this application can import new data to the database via an uploaded CSV file. One of the issues we're working to solve right now is duplicate, or more importantly, near duplicate records. For example, if I have the following:
Record A: [1, Bob, Jones, Atlanta, GA, 30327, (404) 555-1234]
and
Record B: [2, Bobby, Jones, Atlanta, GA, 30327, Bob's Shoe Store, (404) 555-1234]
I need a way to see that these are both similar, take the record with more information (in this case record B) and remove record A.
But here's where it gets even more complicated. This must be done upon importing new data, and a function I can execute to remove duplicates from the database at any time. I have been able to put something together in PHP that gets all duplicate rows from the MySQL table and matches them up by phone number, or by using implode() on all columns in the row and then using strlen() to decide the longest record.
There has got to be a better way of doing this, and one that is more accurate.
Do any of you have any brilliant suggestions that I may be able to implement or build on? It's obvious that when importing new data I'll need to open their CSV file into an array or temporary MySQL table, do the duplicate/similar search, then recompile the CSV file or add everything from the temporary table to the main table. I think. :)
I'm hoping that some of you can point out something that I may be missing that can scale somewhat decently and that's somewhat accurate. I'd rather present a list of duplicates we're 'unsure' about to a user that's 5 records long, not 5,000.
Thanks in advance!
Alex

If I were you I'd give a UNIQUE key to name, surname and phone number since in theory if all these three are equal then it means that it is a duplicate. I am thinking so because a phone number can have only one owner. Anyways, you should find a combination of 2-3 or maybe 4 columns and assign them a unique key. Once you have such a structure, run something like this:
// assuming that you have defined something like the following in your CREATE TABLE:
UNIQUE(phone, name, surname)
// then you should perform something like:
INSERT INTO your_table (phone, name, surname) VALUES ($val1, $val2, $val3)
ON DUPLICATE KEY UPDATE phone = IFNULL($val1, phone),
name = IFNULL($val2, name),
surname = IFNULL($val3, surname);
So basically, if the inserted value is a duplicate, this code will update the row, rather than inserting a new one. The IFNULL function performs a check to see whether the first expression is null or not. If it is null, then it picks the second expression, which in this case is the column value that already exists in your table. Hence, it will update your row with as much as information possible.

I don't think there're brilliant solutions. You need to determine priority of your data fields you can rely on for detecting similarity, for example phone, some kind of IDs, of some uniform address or official name.
You can save some cleaned up values (reduced to the same format like only digits in phones, concatenated full address) along with row which you would be able to use for similarity search when adding records.
Then you need to decide on data completeness in any case to update existing rows with more complete fields, or delete old and add new row.
Don't know any ready solutions for such a variable task and doubt they exist.

Related

Processing feedback about duplicate rows on bulk insert

I have a service that allows users to import multiple items at once besides filling the form, uploading the csv file where each row is representing an item - entity using an id that is set under an unique field in my mysql database (only one item with specific id can exist).
When user finishes with the upload and csv processing, I would like to provide feedback about what items in their file already existed in the database. I decided to go with INSERT IGNORE, parsing the id's out of warnings (regex) and retrieving item information (SELECT) based on collected id's. Browsing the internet, I did not find the common solution for this so I would like to know if this approach is correct, specially when dealing with larger number of rows (500+).
Base idea:
INSERT IGNORE INTO (id, name, address, phone) VALUES (x,xx,xxx,xxxx), (y,yy,yyy,yyyy), etc;
SHOW WARNINGS;
$warning_example = [0=>['Message'=>'Duplicate entry on '123456'...'], 1=>['Message'=>'Duplicate entry on '234567'...']];
$duplicates_count = 0;
foreach($warning_example as $duplicated_item) {
preg_match('/regex_to_extract_id/', $duplicated_item['Message'], $result);
$id[$duplicates_count] = $result;
$duplicates_count++;
}
$duplicates_string = implode(',',$id);
SELECT name FROM items WHERE id IN ($duplicates_string);
Also, what would be the simplest and most efficient regex for this task since the message structure is the same every time.
Duplicate entry '12345678' for key 'id'
Duplicate entry '23456789' for key 'id'
etc.
With preg_match:
preg_match(
"/Duplicate entry '(\d+)' for key 'id'/",
$duplicated_item['Message'],
$result
);
$id[$duplicates_count] = $result[1];
(\d+) represents a sequence of digits (\d), that should be captured (surrounding parentheses).
However, there are better ways to proceed, if you have control over the way the data is imported. To start with, I would recommend first running a SELECT statement to check if a record already exists, and running the INSERT only when needed. This avoids generating errors on the database side. Also, it is much more accurate than using INSERT IGNORE, which basically ignores all error that occur during insertion (wrong data type or length, non-nullable value, ...) : for this reason, it is usually not a good tool to check for unicity.

Generate String based on multiple values just inserted

This one might be a bit complicated. I searched for similar questions and found nothing that seemed relevant.
Let me start by establishing my database structure.
I have several tables, but the relevant ones are as follows:
Right now I have the decklist stored as a string of cardid delimited by a comma. I realize this is inefficient and when I get around to improving my code I will make a new table tcg_card_in_deck that has relationid, cardid, deckid. For now my code assumes a decklist string.
I'm building a function to allow purchases of a deck. In order to give them the actual cards, I have the following query (generated with the PHP it will actually be about 50 entries):
$db->query_write("INSERT INTO
`tcg_card`
(masterid, userid, foil)
VALUES
('159', '15', '0'),
('209', '15', '0'),
('209', '15', '0'),
('318', '15', '0')");
This part is easy. My issue now is making sure the cards that have just been added can have their ids grabbed and put together in an array (to enter in as a string currently, and as entries into the separate table once the code is updated). If it was one entry I could use LAST_INSERT_ID(). If I did 50 separate insert queries I could grab the id on each iteration and add them into the array. But because it's all done with one query, I don't know how to effectively find the correct cards to put in the decklist. I suppose I could add a dateline field to the cards table to specify date acquired, but that seems sloppy and it may produce flawed results if a user gets cards from a trade or a booster pack in a similar timeframe.
Any advice would be appreciated!
Change tcg_card by removing cardid, and make masterid and userid a compound key. Then, add a row called quantity. Since you cannot distinguish between two copies of a card in any meaningful way (except for being foils, which you could handle with this schema), there is no need for every card to get its own ID.
Presumably you aren't entering new tcg_master rows dynamically, so you don't have to worry about pulling their IDs back out.
Reading comments on this question I thought of a very simple and easy solution:
Get all inserted IDs when inserting multiple rows using a single query
I already track booster pack purchases with a table tcg_history. This table can also track other types of purchases, such as a starter deck.
I simply need to add a field on the tcg_card table that references a tcg_history.recordid, then I will be able to select all cards that are from that purchase.

organise change field values in mysql

i need to sort through a column in my database, this column is my category structure the data thats in the column is city names but not all the names are the same for each city, what i need to do is go through the values in the column i may have 20-40 value that are the same city but written differently i need a script that can interpret them and change them to a single value
so i may have two values in the city column say:( england > london ) and ( westlondon ) but i need to change to just london, is there a script out there that is capable of interpreting the values that are already there and change them to the value would want i know the dificult way of doing this one by one but wondered if there was a script in any language that could complete this
I've done this sort of data clean-up plenty of times and I'm afraid I don't know of anything easier than just writing your own fixes.
One thing I can recommend is making the process repeatable. Have a replacement table with something like (rulenum, pattern, new_value). Then, work on a copy of the relevant bits of your table so you can just re-run the whole script.
Then, you can start with the obvious matches (just see what looks plausible) and move to more obscure ones. Eventually you'll have 50 without matches and you can just manually patch entries for this.
Making it repeatable is important because you'll be bound to find mis-matches in your first few attempts.
So, something like (syntax untested):
CREATE TABLE matches (rule_num int PRIMARY KEY, pattern text, new_value text)
CREATE TABLE cityfix AS
SELECT id, city AS old_city, '' AS new_city, 0 AS match_num FROM locations;
UPDATE c SET c.new_city = m.new_value, c.match_num = m.rule_num
FROM cityfix AS c JOIN matches m ON c.old_city LIKE m.pattern
WHERE c.match_num = 0;
-- Review results, add new patterns to rule_num, repeat UPDATE
-- If you need to you can drop table cityfix and repeat it.
Just an idea: 16K is not so much. first use Perl's DBI (im assuming you are going to use Perl) to fetch that city column, store it in a hash (city name as the hash), then find your an algorithm that suites your needs (performance wise) to iterate over the hash keys and use String::Diff to find matching intersection (read about it, it definitely can help you out) and store it as a value.. then you can use that to update the database using the key (old value) and the value as the new value to update.

Could MySQL query help find similar customer records?

I have a directory of companies provided to me they want stored and updated in a MySQL database. There is no unique identifier such as company #1234 for each company record.
The fields are typical for a mailing list, contact name, company name, street address, city, state, zip code, phone number and type of company. Updates will be sent to me as a CSV file, again, with no company unique identifier number.
How do I go about matching up the stored record in the db to the new one so it can be updated? In this industry the contact name can change, and even the company name because they add and subtract partners. Their street address can change because when they move the business, and they can even change their phone number. The majority of the companies have a website URL, so hopefully that won't change often but it easily could as well.
I've seen in MySQL there is a similar match %, would this be the answer to match records with the new information?
I work in PHP, if there is a PHP solution. Thanks in advance to the kind soul who helps me out with this!
Without primary key, it is always tricky.
One line solution, decide the rules to best suite your requirements.
If I were you, I first would go to the client to decide some rules of identifying similar records. This step is necessary as without primary key, as there is always a chance of duplicate entry or updating wrong record.
Rules could be simple like:
1. Available fileds:
contact name,
company name,
street address,
city,
state,
zip code,
phone number and
type of company (I Hope this is industry)
2. We will first match company name for similarity like
select * from table_name where company_name like '%$company_name%'
3. For all found records, match zip code and phone number. If match, break, record needs to be updated
4. If not match found in step 3, match street address. If match, break, record needs to be updated
5. & so on.
Your client is the best person to decide these rules as he is the owner of the product.
On the other side, asking rules from client is also important to keep you secure as in the absence of primary key, even after all the care, there is always a chance of duplicating records and/or updating wrong record. You could just minimize the chances with good rules.
As you have told that all the fields of the table can change then I think there is no simple way to correctly update the table every time whatever algorithm you choose.
One of the way to achieve this could be to ask the people/system (which sends you the updated records) to also include the old values of the updated fields in the csv file. If you have the old values you can easily match them with the present records and update it with the new values.
This is rather general question, but the solution itself is somewhat unique from project to project.
I would iterate over all records ordered by the time of their change (or a creating date or update timestamp or so). Next I'd match all entries with major fields similar: company name, address (though that might be risky), telephone or an url (parsing domains only). Then, I would recursively iterate over all found entries until no more results are found.
This algorithm would help to find you same entries as long as they do not have all major columns changes at once. If they do, there is no way saying it's the same firm programmatically.
This will link rows with seemingly now connections (rows 1 and 3 in example)
Example:
2001/01/01 Awesome firm, awesome.com
2002/02/02 Awesome firm, newaddress.com // linked with the first row over company name
2010/12/05 Ohsome inc, newaddress.com // linked over url
I have come acroos bit similar scenario in one of my earlier projects in Sql server.I used to do the following things to handle it.
1.Usually there will be 2 types of files--
a)Full feed (frequency weekly) this will have all the companies from the providers database
b)Incremental Feed(Frequency Daily) this will have only the new records which are not in full feed and updates.(inserts-I,updates -U as flags in incremental feeds)
2.So once I receive the full feed I will refresh the my database table with the full feed once in a week.Also here I will have my internal ids to each company record.(thses ids are for internal purpose)
3.On daily basis I process incremental feeds based on the flags(I-insert,U-update).
4.One very important thing here is to manage the mapping table.Once the feed comes just assign a new internal id to it.
5.For comparing the data to avoid duplicates,I used to use Fuzzy algorithm to get all the potential matches and then use wildcard characters to filter and identify which are new and duplicates.
Have a look at the Damerau-Levenshtein distance algorithm. It calculates the "distance" between two strings and determines how many steps it takes to transform one string into another. The less steps the closer the two strings are.
This article shows the algorithm implemented as a MySQL stored function. Here's the PHP version.
The algorithm is so much better than LIKE or SOUNDEX.

Is there a PHP Class That I Can Use to Make Sure My Users Can't Enter the Same Data Twice (Preventing Duplicate Data)?

Problem Overview:
My application has an enrollment form.
Users have a habit of entering the same person into the system twice.
I need to find a way to rapidly and accurately check the data they've entered against the other clients in the database to see if that client is already in the database.
Criteria Currently Being Used:
Duplicate SSN
Duplicate Last Name and Date of Birth
Duplicate First Name, Date Of Birth and Partial SSN Match (another client has an SSN where 5 of the 9 digits are the same and in the same position.
Duplicate First Name and Partial SSN Match (another client has an SSN where 5 of the 9 digits are the same and in the same position.
Duplicate Last Name and Partial SSN Match (another client has an SSN where 5 of the 9 digits are the same and in the same position.
In addition to these checks, there's been discussion of using soundex to detect matches based on similiar first name / last name.
Is there a PHP class already designed to handle something like this? Can something like this be done at a (Mysql) Database level?
Clarifications:
The problem exists not because of a lack of data integrity at the database level but because of typos caused during the entry process. The applicaiton is a data-entry application. Users are taking physical paper copies of forms and entering the data into the application.
If I understand your problem correctly the point is that the duplicates you want to filter out are not necessarely equal as strings. I encountered situations like this a couple of times in the past and I could never find a perfect criteria for finding logical duplicates. In my opinion the best way to deal with such cases is to provide a very smart autocomplete-like functionallity to the user, so when he tries to enter the data he sees all the similar entries and he hopefully won't create a new entry for something he see in the list. Such a soulution can be a good "buddy" of your not-yet-perfect criteria.
Not a php solution, but
You can cast that fields in your database as unique.
ALTER TABLE `users` ADD UNIQUE (
`username`
)

Categories