Processing feedback about duplicate rows on bulk insert

Processing feedback about duplicate rows on bulk insert - php

I have a service that allows users to import multiple items at once besides filling the form, uploading the csv file where each row is representing an item - entity using an id that is set under an unique field in my mysql database (only one item with specific id can exist).
When user finishes with the upload and csv processing, I would like to provide feedback about what items in their file already existed in the database. I decided to go with INSERT IGNORE, parsing the id's out of warnings (regex) and retrieving item information (SELECT) based on collected id's. Browsing the internet, I did not find the common solution for this so I would like to know if this approach is correct, specially when dealing with larger number of rows (500+).
Base idea:
INSERT IGNORE INTO (id, name, address, phone) VALUES (x,xx,xxx,xxxx), (y,yy,yyy,yyyy), etc;
SHOW WARNINGS;
$warning_example = [0=>['Message'=>'Duplicate entry on '123456'...'], 1=>['Message'=>'Duplicate entry on '234567'...']];
$duplicates_count = 0;
foreach($warning_example as $duplicated_item) {
preg_match('/regex_to_extract_id/', $duplicated_item['Message'], $result);
$id[$duplicates_count] = $result;
$duplicates_count++;
}
$duplicates_string = implode(',',$id);
SELECT name FROM items WHERE id IN ($duplicates_string);
Also, what would be the simplest and most efficient regex for this task since the message structure is the same every time.
Duplicate entry '12345678' for key 'id'
Duplicate entry '23456789' for key 'id'
etc.

With preg_match:
preg_match(
"/Duplicate entry '(\d+)' for key 'id'/",
$duplicated_item['Message'],
$result
);
$id[$duplicates_count] = $result[1];
(\d+) represents a sequence of digits (\d), that should be captured (surrounding parentheses).
However, there are better ways to proceed, if you have control over the way the data is imported. To start with, I would recommend first running a SELECT statement to check if a record already exists, and running the INSERT only when needed. This avoids generating errors on the database side. Also, it is much more accurate than using INSERT IGNORE, which basically ignores all error that occur during insertion (wrong data type or length, non-nullable value, ...) : for this reason, it is usually not a good tool to check for unicity.

Related

Best way to generate (and save) incremental invoice numbers in a multi-tenant MySQL database

I have found two different ways to, first, get the next invoice number and, then, save the invoice in a multi-tenant database where, of course, each tenant will have his own invoices with different incremental numbers.
My first (and actual) approach is this (works fine):
Add a new record to the invoices tables. No matter the invoice number yet (for example, 0, or empty)
I get the unique ID of THAT created record after insert
Now I do a "SELECT table where ID = $lastcreatedID **FOR UPDATE**"
Here I get the latest saved invoice number with "SELECT #A:=MAX(NUMBER)+1 FROM TABLE WHERE......"
Finally I update the previously saved record with that invoice number with an "UPDATE table SET NUMBER = $mynumber WHERE ID = $lastcreatedID"
This works fine, but I don't know if the "for update" is really needed or if this is the correct way to do this in a multi-tenant DB, due to performance, etc.
The second (and simpler) approach is this (and works too, but I don't know if it is a secure approach):
INSERT INTO table (NUMBER,TENANT) SELECT COALESCE(MAX(NUMBER),0)+1,$tenant FROM table WHERE....
That's it
Both methods are working, but I would like to know the differences between them regarding speed, performance, if it may create duplicates, etc.
Or... is there any better way to do this?
I'm using MySQL and PHP. The application is an invoice/sales cloud software that will be used by a lot of customers (tenants).
Thanks

Regardless of if you're using these values as database IDs or not, re-using IDs is virtually guaranteed to cause problems at some point. Even if you're not re-using IDs you're going to run into the case where two invoice creation requests run at the same time and get the same MAX()+1 result.
To get around all this you need to reimplement a simple sequence generator that locks its storage while a value is being issued. Eg:
CREATE TABLE client_invoice_serial (
-- note: also FK this back to the client record
client_id INTEGER UNSIGNED NOT NULL PRIMARY KEY,
serial INTEGER UNSIGNED NOT NULL DEFAULT 0
);
$dbh = new PDO('mysql:...');
/* this defaults to 'on', making every query an implicit transaction. it needs to
be off for this. you may or may not want to set this globally, or just turn it off
before this, and back on at the end. */
$dbh->setAttribute(PDO::ATTR_AUTOCOMMIT,0);
// simple best practice, ensures that SQL errors MUST be dealt with. is assumed to be enabled for the below try/catch.
$dbh->setAttribute(PDO::ATTR_ERRMODE_EXCEPTION,1);
$dbh->beginTransaction();
try {
// the below will lock the selected row
$select = $dbh->prepare("SELECT * FROM client_invoice_serial WHERE client_id = ? FOR UPDATE;");
$select->execute([$client_id]);
if( $select->rowCount() === 0 ) {
$insert = $dbh->prepare("INSERT INTO client_invoice_serial (client_id, serial) VALUES (?, 1);");
$insert->execute([$client_id]);
$invoice_id = 1;
} else {
$invoice_id = $select->fetch(PDO::FETCH_ASSOC)['serial'] + 1;
$update = $dbh->prepare("UPDATE client_invoice_serial SET serial = serial + 1 WHERE client_id = ?");
$update->execute([$client_id])
}
$dbh->commit();
} catch(\PDOException $e) {
// make sure that the transaction is cleaned up ASAP, then let the exception bubble up into your general error handling.
$dbh->rollback();
throw $e; // or throw a more pertinent error/exception of your choosing.
}
// both committing and rolling back will release the lock
At a very basic level this is what MySQL is doing in the background for AUTOINCREMENT columns.

Do not use MAX(id)+1. It will, someday, bite you. There will be two invoices with the same number, and it will take us a few paragraphs to explain why it happened.
Instead, use AUTO_INCREMENT the way it is intended.
INSERT INTO Invoices (id, ...) VALUES (NULL, ...);
SELECT LAST_INSERT_ID(); -- specific to the conne ction
That is safe even when multiple connections are doing the same thing. No FOR UPDATE, no BEGIN, etc is necessary. (You may want such for other purposes.)
And, never delete rows. Instead, use the standard business practice of invalidating bad invoices. Imagine being audited.
All that said, there is still a potential problem. After a ROLLBACK or system crash, an id may be "burned". Also things like INSERT IGNORE allocate the id before checking to see whether it will be needed.
If you can live with the caveats, use AUTO_INCREMENT.
If not, then create a 1-row, 2-column table to simulate a sequence number generator: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#sequence
Or use MariaDB's SEQUENCE

Both the approaches do work, but each with its own demerits in high traffic situations.
The first approach runs 3 queries for every invoice you create, putting extra load on your server.
The second approach can lead to duplicates in events where two invoices are generated with very little time difference (such that the SELECT query return same max number for both invoices).
Both the approaches may lead to problems in high traffic conditions.
Two solutions to the problems are listed below:
Use generated columns: Mysql supports generated columns, which are basically derived using other column values for each row. Refer this
Calculate invoice number on the fly: Since you're using the primary key as part of the invoice, let the DB handle generating unique primary keys, and then generate invoice numbers on the fly in your business logic using the id for each invoice.

Generate String based on multiple values just inserted

This one might be a bit complicated. I searched for similar questions and found nothing that seemed relevant.
Let me start by establishing my database structure.
I have several tables, but the relevant ones are as follows:
Right now I have the decklist stored as a string of cardid delimited by a comma. I realize this is inefficient and when I get around to improving my code I will make a new table tcg_card_in_deck that has relationid, cardid, deckid. For now my code assumes a decklist string.
I'm building a function to allow purchases of a deck. In order to give them the actual cards, I have the following query (generated with the PHP it will actually be about 50 entries):
$db->query_write("INSERT INTO
`tcg_card`
(masterid, userid, foil)
VALUES
('159', '15', '0'),
('209', '15', '0'),
('209', '15', '0'),
('318', '15', '0')");
This part is easy. My issue now is making sure the cards that have just been added can have their ids grabbed and put together in an array (to enter in as a string currently, and as entries into the separate table once the code is updated). If it was one entry I could use LAST_INSERT_ID(). If I did 50 separate insert queries I could grab the id on each iteration and add them into the array. But because it's all done with one query, I don't know how to effectively find the correct cards to put in the decklist. I suppose I could add a dateline field to the cards table to specify date acquired, but that seems sloppy and it may produce flawed results if a user gets cards from a trade or a booster pack in a similar timeframe.
Any advice would be appreciated!

Change tcg_card by removing cardid, and make masterid and userid a compound key. Then, add a row called quantity. Since you cannot distinguish between two copies of a card in any meaningful way (except for being foils, which you could handle with this schema), there is no need for every card to get its own ID.
Presumably you aren't entering new tcg_master rows dynamically, so you don't have to worry about pulling their IDs back out.

Reading comments on this question I thought of a very simple and easy solution:
Get all inserted IDs when inserting multiple rows using a single query
I already track booster pack purchases with a table tcg_history. This table can also track other types of purchases, such as a starter deck.
I simply need to add a field on the tcg_card table that references a tcg_history.recordid, then I will be able to select all cards that are from that purchase.

PHP CSV Import Questions

I'm working with importing CSV files into a database, and it is a multi-step process.
Here are the steps:
Step 1: User uploads a CSV file.
Step 2: User associates the data to a data type. For example, if a record in the CSV contains the following data: John,Doe,jondoe#gmailcom, the user would select firstname from a dropdown box to associate to the data value John, lastname from a dropdown box that associates to the data value Doe, and emailaddress from a dropdown box that associates to the data value johndoe#gmail.com
Step 3: Insert data into database
My questions are the following:
1./ On step 3, I would have in my possession the columns which the user chose and the original data.
Here is what the original data looks like:
$data = array(
0 => array('John','Doe','johndoe#gmail.com'),
1 => array('Foo','Bar','foobar#gmail.com')
);
And here is what my columns chosen from step 2 looks like:
$columns = array('firstname','lastname','emailaddress')
How do I create a sql query that can be like the following:
INSERT into contacts (id,firstname,lastname,emailaddress) values (null,'John','Doe','johndoe#gmailcom')
As you can see, the sql query has the columns chosen in the order that they are within the array and then subsequently the values. I was thinking that since the columns are chosen in the order of the data, I can just assume that the data is in the correct order and is associated to the specific column at that position (for example, I can assume that the data value 'John' was associated to the first position of the columns array, and vice versa).
2./ I was thinking of a possible scenario that when the user does the initial upload of the file, they could potentially send a csv file with the first record having a blank field. The problem is, I determine how many columns to have the user associate to the data based on the number of columns within a csv record. In this case, we have 2 columns and every subsequent record has 3 columns. Well, I'm not going to loop through the entire set of records to determine the correct number of columns. How do I resolve this issue? Any ideas?
EDIT
I think I figured out the answer to question 2. On the parsing of the csv file, I can get a count for each record and the highest count at the end of the parsing is my count. Seems right? Any issues with that?

To parse the data from the CSV file, look at fgetcsv. http://php.net/manual/en/function.fgetcsv.php
It'll load a line from the file and return an array of the CSV fields.
$data = array();
while (($lineFields = fgetcsv($handle)) !== false) {
$data[] = $lineFields;
}
This assumes you are using PHP5 and opened the file with $handle. In PHP4 fgetcsv needs a second parameter for max length of line to read.
For the query:
$sql = "INSERT into contacts (id," + implode(',', $columns) + ") values";
I'm not including anything after the values. You should be creating prepared statements to protect against sql injections. Also if you are using MySQL, id should be an autoincrement field and omitted from inserts (let MySQL generate it). If you are using Postgres, you'll need to create a sequence for the id field. In any case let the database generate the id for you.

MySQL Remove/Combine Similar Rows

I've got a problem that I just can't seem to find the answer to. I've developed a very small CRM-like application in PHP that's driven by MySQL. Users of this application can import new data to the database via an uploaded CSV file. One of the issues we're working to solve right now is duplicate, or more importantly, near duplicate records. For example, if I have the following:
Record A: [1, Bob, Jones, Atlanta, GA, 30327, (404) 555-1234]
and
Record B: [2, Bobby, Jones, Atlanta, GA, 30327, Bob's Shoe Store, (404) 555-1234]
I need a way to see that these are both similar, take the record with more information (in this case record B) and remove record A.
But here's where it gets even more complicated. This must be done upon importing new data, and a function I can execute to remove duplicates from the database at any time. I have been able to put something together in PHP that gets all duplicate rows from the MySQL table and matches them up by phone number, or by using implode() on all columns in the row and then using strlen() to decide the longest record.
There has got to be a better way of doing this, and one that is more accurate.
Do any of you have any brilliant suggestions that I may be able to implement or build on? It's obvious that when importing new data I'll need to open their CSV file into an array or temporary MySQL table, do the duplicate/similar search, then recompile the CSV file or add everything from the temporary table to the main table. I think. :)
I'm hoping that some of you can point out something that I may be missing that can scale somewhat decently and that's somewhat accurate. I'd rather present a list of duplicates we're 'unsure' about to a user that's 5 records long, not 5,000.
Thanks in advance!
Alex

If I were you I'd give a UNIQUE key to name, surname and phone number since in theory if all these three are equal then it means that it is a duplicate. I am thinking so because a phone number can have only one owner. Anyways, you should find a combination of 2-3 or maybe 4 columns and assign them a unique key. Once you have such a structure, run something like this:
// assuming that you have defined something like the following in your CREATE TABLE:
UNIQUE(phone, name, surname)
// then you should perform something like:
INSERT INTO your_table (phone, name, surname) VALUES ($val1, $val2, $val3)
ON DUPLICATE KEY UPDATE phone = IFNULL($val1, phone),
name = IFNULL($val2, name),
surname = IFNULL($val3, surname);
So basically, if the inserted value is a duplicate, this code will update the row, rather than inserting a new one. The IFNULL function performs a check to see whether the first expression is null or not. If it is null, then it picks the second expression, which in this case is the column value that already exists in your table. Hence, it will update your row with as much as information possible.

I don't think there're brilliant solutions. You need to determine priority of your data fields you can rely on for detecting similarity, for example phone, some kind of IDs, of some uniform address or official name.
You can save some cleaned up values (reduced to the same format like only digits in phones, concatenated full address) along with row which you would be able to use for similarity search when adding records.
Then you need to decide on data completeness in any case to update existing rows with more complete fields, or delete old and add new row.
Don't know any ready solutions for such a variable task and doubt they exist.

Using explode, split, or preg_split to store and get multiple database entries

I'm trying to figure out how and which is best for storing and getting multiple entries into and from a database. Either using explode, split, or preg_split. What I need to achieve is a user using a text field in a form to either send multiple messages to different users or sharing data with multiple users by enter their IDs like "101,102,103" and the PHP code to be smart enough to grab each ID by picking them each after the ",". I know this is asking a lot, but I need help from people more skilled in this area. I need to know how to make the PHP code grab IDs and be able to use functions with them. Like grabbing "101,102,103" from a database cell and grabbing different stored information in the database using the IDs grabbed from that one string.
How can I achieve this? Example will be very helpful.
Thanks

If I understand your question correctly, if you're dealing with comma delimited strings of ID numbers, it would probably be simplest to keep them in this format. The reason is because you could use it in your SQL statement when querying the database.
I'm assuming that you want to run a SELECT query to grab the users whose IDs have been entered, correct? You'd want to use a SELECT ... WHERE IN ... type of statement, like this:
// Get the ids the user submitted
$ids = $_POST['ids'];
// perform some sanitizing of $ids here to make sure
// you're not vulnerable to an SQL injection
$sql = "SELECT * FROM users WHERE ID IN ($ids)";
// execute your SQL statement
Alternatively, you could use explode to create an array of each individual ID, and then loop through so you could do some checking on each value to make sure it's correct, before using implode to concatenate them back together into a string that you can use in your SELECT ... WHERE IN ... statement.
Edit: Sorry, forgot to add: in terms of storing the list of user ids in the database, you could consider either storing the comma delimited list as a string against a message id, but that has drawbacks (difficult to do JOINS on other tables if you needed to). Alternatively, the better option would be to create a lookup type table, which basically consists of two columns: messageid, userid. You could then store each individual userid against the messageid e.g.
messageid | userid
1 | 1
1 | 3
1 | 5
The benefit of this approach is that you can then use this table to join other tables (maybe you have a separate message table that stores details of the message itself).
Under this method, you'd create a new entry in the message table, get the id back, then explode the userids string into its separate parts, and finally create your INSERT statement to insert the data using the individual ids and the message id. You'd need to work out other mechanisms to handle any editing of the list of userids for a message, and deletion as well.
Hope that made sense!

Well, considering the three functions you suggested :
explode() will work fine if you have a simple pattern that's always the same.
For instance, always ', ', but never ','
split() uses POSIX regex -- which are deprecated -- and should not be used anymore.
preg_split() uses a regex as pattern ; and, so, will accept more situations than explode().
Then : do not store several values in a single database column : it'll be impossible to do any kind of useful work with that !
Create a different table to store those data, with a single value per row -- having several rows corresponding to one line in the first table.

I think your problem is more with SQL than with PHP.
Technically you could store ids into a single MySQL field, in a 'set' field and query against it by using IN or FIND_IN_SET in your conditions. The lookups are actually super fast, but this is not considered best practice and creates a de-normalized database.
What is nest practice, and normalized, is to create separate relationship tables. So, using your example of messages, you would probably have a 'users' table, a 'messages' table, and a 'users_messages' table for relating messages between users. The 'messages' table would contain the message information and maybe a 'user_id' field for the original sender (since there can only be one), and the 'users_messages' table would simply contain a 'user_id' and 'message_id' field, containing rows linking messages to the various users they belong to. Then you just need to use JOIN queries to retrieve the data, so if you were retrieving a user's inbox, a query would look something like this:
SELECT
messages.*
FROM
messages
LEFT JOIN users_messages ON users_messages.message_id = messages.message_id
WHERE
users_messages.user_id = '(some user id)'

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.