MySQL Insert: Test first?

MySQL Insert: Test first? - php

As an example, when inserting a record into a table with a unique index, is it best to test first? e.g.,
$mysqli->query('SELECT email FROM tblUser WHERE email = 'foo#bar.org');
then make sure 0 rows are returned, then do the insert?
$mysqli->query('INSERT INTO tblUser ...');
Or is it better to just skip the test and handle the error in the event there's a duplicate entry?
THANKS!

It's better to insert and handle any duplicate key errors.
The reason is that if you test first, some other client can still insert the value in the brief moment between your test and your insert. So you'd need to handle errors anyway.

Broadly speaking, there are three ways to handle this situation with a single query (fewer queries is usually a good thing to shoot for), but none of them is a universal "best way". Which you should use depends on your needs.
The first is, as you mention, running the INSERT … blindly and handling any errors PHP. This is the best approach when a duplicate key indicates a procedural problem (a bug in the software, a user trying to register a name that's already been used, etc.), as it allows you to perform additional operations before committing to a database update.
Second, there is the INSERT IGNORE … syntax. I would tend to call this the least commonly-useful approach, as it discards your INSERT completely if the key already exists. Primarily useful when a row (or rows) may or may not have been added to the table previously, but the data is known not to have changed.
Lastly, you can use an INSERT … ON DUPLICATE KEY UPDATE … statement. These can get rather verbose, but are very handy, as they allow you to insert data into your table without worrying about whether older data exists. If so, the existing row is updated. If not, a new one is inserted. Either way, your table will have the latest data available.

MySQL supports insert ignore if you want to ignore an insert that creates a row that has a key value that already exists for another row.
Just make sure there's a unique index on email in tblUser and do
$mysqli->query('INSERT IGNORE INTO tblUser ...');

It depends on if you want to ensure that the values you are inserting don't exist or not. If you have a unique key on the file then it is going to be important that you do not create a duplicate key (which will throw an error). A lot of times too you want to test to see if a record exists, if so returning the primary key of the record so you can update the record and if not then inserting the record.
But if you have no unique keys and don't care if information is duplicated across a field or combination of fields then it isn't necessary and can save a little time. It just depends on the situation.
HTH

Often depends on what rules about data duplication apply.
In your example, does your app permit more than one user to have the same email address? If not then you'd need to perform that check.

You definitely want to test first and you may want to test a few things so you can tell the user what went wrong.
For example I just finished a job where a user needed a unique username and a unique email address.

Related

How to avoid inserting duplicates into MySQL?

I have an analytics platform with lots of users and hundreds of inserting clicks / minute.
Sometimes I see that the exact same click is inserted to the Database within the same second and it becomes a duplicate of the other.
I have a system which checks if the table has the same value and not letting the other inserted if it finds one.
However in this case it looks to me that they're inserted into the DB in the exact same milisecond.
What can I do here?

My favorite: insert ignore myTable (col1, col2, ...) ...
where unique key(s) are setup beforehand to forbid the insert. It would appear that you do not care so much that it was previously inserted as much as you care that the end result is not dupes.
Note: the unique keys can be multi-column keys (composites)
A word of warning about insert ignore: it should not be implemented without careful thought of its ramifications for sensitive systems that need to know that the row was truly already there. It is ideal for "make sure it is there".
Option B: One could look into intention locks, like here, but crafted for your particular use-case. Steer toward INNODB row-level locking that is swifty, and certainly not table locks. Most things come with a trade-off. The downside of locking is diminished concurrency.
Option C: For the faint-of-heart (sometimes me). And this is what I would do if hired out and wish not to have peer backlash later. Perform an Insert ... on Duplicate Key Update (IODKU), and have a bogus column like touches that is an int that you increment for the Update part of the IODKU. Example below:
insert myTable (col1, col2, col3) values (p1,p2,p3)
on duplicate key update touches=touches+1;
That above would be in a most minimalist form. A view below is what I use in C# where I care about more columns in the "update part of IODKU", but just to show that, if it benefits anyone:
A final thought on IODKU: it is mandatory to have a unique key (primary or just unique) that causes the "clash" to occur. Thus, the statement knows whether or not to perform the insert or the update. Without such a unique key clash, a new row will be inserted.
Back to the op issue, the reason your system probably already had the row there was due to high concurrency use without locking.

If the system's architecture allows it I would create two-tier solution. First, a temporary table where duplicate data would be inserted. The temporary table's name can contain a sharding parameter, for example, an hour number. The system will periodically export data from temporary tables into the main storage table, discarding duplicate data. Then it can discard the temporary tables.

Searching for duplicate entries with PDO

I'm having a spot of trouble with a bit of code meant to find duplicates of a name along with the platform. This will also be adapted to find unique IDs later on.
So for example, if there is a server named "Apple" on the Xbox and you try to insert a record with the name "Apple" with the same platform it will reject it. However, another platform with the same name is allowed, such as "Apple" with PS3.
I've tried coming up with ideas and searching for answers, but I'm kind of in the dark as to what is the best way to go about checking for duplicates.
So far this is what I have:
$nameDuplicate_sql = $db->prepare("SELECT * FROM `servers` WHERE name=':name' AND platform=':platform'");
$nameDuplicate_sql->bindValue(':name', $name);
$nameDuplicate_sql->bindValue(':platform', $platform);
$nameDuplicate_sql->execute();
I've tried a bunch of different solutions, some from here, others from the PHP's manual and etc. None appear to work though.
I'm trying to stick with PDO, however, this is one instance where I cannot figure out where to turn. If this was in mysql_* I probably could just use mysql_affected_rows, but with PDO I have no clue. rowCount seemed promising, but it always returns 0 since this is neither an INSERT, UPDATE, or DELETE statement.
Oh, and I've tried the SQL statement in phpMyAdmin and it works; I tried it with a simple name/platform and it found rows properly.
If anyone can help me out here I'd appreciate it.

For most databases, PDOStatement::rowCount() does not return the
number of rows affected by a SELECT statement.
Instead, use PDO::query() to issue a SELECT COUNT(*) statement with the same predicates as your intended SELECT statement, then use
PDOStatement::fetchColumn() to retrieve the number of rows that will
be returned.
Your application can then perform the correct action.

Instead of checking for duplicates, why not just enforce it on the database table directly? Create a composite key that will prohibit entries being made if they are already there?
CREATE TABLE servers (
serverName varchar(50),
platform varchar(50),
PRIMARY KEY (serverName, platform)
)
This way, you will never get duplicates, and it also allows you to use the mysql insert... on duplicate key update... syntax which sounds like it might be rather handy for you.
If you already have a Primary Key on it or you don't want to make a new table, you can use the following:
ALTER TABLE servers DROP PRIMARY KEY, ADD PRIMARY KEY(serverName, platform);
Edit: A primary key is either a single row or a number of rows that have to have unique data in them. A single row cannot have the same value twice, but a composite key (which is what I am suggesting here) means that between the two columns, the same data cannot appear.
In this case, what you want to do, add in a server name and have it associated with a platform - the table will let you add in as many rows containing the same server name - as long as each one has a unique platform associated with it - and vice versa, you can have a platform listed as many times as you like, as long as all the server names are unique.
If you try to insert a record where the same servername/platform combination exists, the database simply won't let you do it. There is another golden benefit though. Due to this key constraint - mysql allows a special type of query to be used. It is the insert... on duplicate key update syntax. That means if you try to insert the same data twice (ie, database says no) you can catch it and update the row you already have in the table. For example:
You have a row with serverName=Fluffeh and it is on platform=Boosh but you don't know about it right now, so you try to insert a record with the intention of updating the server IP address.
Normally you would simply write something like this:
insert into servers (serverName, platform, IPAddress)
values ('$serverName', '$platform', '$IPAddy')
But with a nice primary key identified you can do this:
insert into servers (serverName, platform, IPAddress)
values ('$serverName', '$platform', '$IPAddy')
on duplicate key update set IPAddress='$IPAddy';
The second query will insert the row with all the data if it doesn't exist already. If it doesm, Bam! it will update the IP Address of the server which was your intention all along.

Remove the single quotes from your query on the parameter tokens... they will be quoted once they are bound... thats part of the reason for a prepared statement.
$nameDuplicate_sql = $db->prepare("SELECT * FROM `servers` WHERE name= :name AND platform= :platform");

How do I guarantee uniqueness of a table value only within a subset of the table rows?

I have a table of projects belonging to various users:
project_id, owner_user_id, project_name
I do not need the project_names to be globally unique to the table, so making project_name UNIQUE does not help. I would just like to prevent the user from creating duplicate project_names on INSERT or UPDATE.
Upon INSERT/UPDATE, I simply want to check if there is already a project_name belonging to a specific owner_user_id, and if it already exists, the INSERT/UPDATE should fail.
I could use a SELECT to first check for existence of the project_name within the user's projects, and then only do an INSERT/UPDATE if the select returns no results. But this is multi-threaded and another thread could INSERT the same project_name immediately after I perform the SELECT but before the INSERT/UPDATE. Putting this all into a transaction feels like overkill. Is there a single query that can perform this instead?

You could add a UNIQUE constraint on the two columns as a pair:
alter table your_table add unique (owner_user_id, project_name)
That will ensure that project_name values are unique per-user. You'll want to have a look at your collation set up to make sure your project_name values are compared without regard to case. Or you could standardize the project names to title case before hitting the database.
Don't try to maintain data integrity by hand unless you have to, let the database take care of your constraints whenever possible.

This does need to be in a transaction. You need to retrieve some information ("which names are already in use?") and then act on it ("if my name is not in use, then use it"). This must be done atomically.
As you have correctly surmised, there is a race condition if the insert does not happen atomically after the check.
This is what transactions are for.

You can add a unique constraint on both fields
CONSTRAINT C_UNICITY UNIQUE (owner_user_id, project_name)
Each time you try to insert or update a record which present duplicate, you'll get a sql error

$result = mysql_query("select * from Project where owner_user_id='1';");
if (mysql_affected_rows()==0) {
$result = mysql_query("insert into Project (projectname) values ('pojectname');");

Depending on database referential integrity violations to throw errors for you to trap is not generally a preferred form of UI validation - you generally want something at a higher abstraction level anyway. But there's nothing particularly "overkill" about using transactions and UNIQUE constraints liberally to protect your data as much as your users.

3 in 1 mysql statement

it is possible in one SQL statement to insert a record then take the autoincrement id, and update for the same record one specific column
with this autoincrement value.
Thanks in Advance.

Strictly speaking you can not do it in a single SQL statement (as others have already pointed out).
However, since you mention that you want to avoid making changes to legacy application let me clarify some options that might work for you.
If you had a trigger on the table that would update the second column, then issuing single insert will give you what you want and you might not need to change anything in the application
If possible, you could rename the table and in its place put a VIEW with the same name. With such simple view it might be transparent to your application (not sure if VIEW would remain updateable with your framework, but generally speaking it should)
Finally, with mysqli library you are free to issue multiple SQL statements, so it will be a single call to the database - which might be enough for you, depending on how exactly you define 'single statement'
None of the above will never be comparable to fixing the application in terms of maintainability for the guy who will inherit your code.

Doing an insert automatically fills in the value for an auto_increment column (just define it to use AUTO_INCREMENT). There is no need to have the same value twice in one record.
Doing an UPDATE + INSERT together is not possible in a single query.

I found this artcile that may be of interest to you:
http://www.daniweb.com/forums/thread107837.html
They suggest it is possible to do the insert and update in one query.
They show a query like:
INSERT INTO table (FIELD) VALUES (value) ON DUPLICATE KEY UPDATE FIELD=value
I hope this helps and to all the nay sayers, anything is possible.
While I believe it is possible, your safest bet is probably to split this operation up into three stages.
I successfully did this on my own database locally with this code:
INSERT INTO status set status_id = 5 ON DUPLICATE KEY UPDATE status_id=5;select last_insert_id()
You should be able to transform it to work for you.

You can write a AFTER INSERT trigger which takes max(id) and updates the record

That's not possible at all.
You have to either do this separately or you may create a function/stored procedure to achieve this mission.

Multiple statements can be separated by a semicolon, but I believe you need to use a function in PHP to get the autoincrement value. Your best bet might be to use a stored procedure.

Key problem: Which key strategy should I use in my database?

Problem: When I use an auto-incrementing primary key in my database, this happens all the time:
I want to store an Order with 10 Items. The ordered Items belong to the Order. So I store the order, ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?), and then store the 10 Items with the foreign key (order_id).
So I always have to do:
INSERT ...
last_inserted_id = db.lastInsertId();
INSERT ...
INSERT ...
INSERT ...
and I believe this prevents me from using transactions in almost all INSERT cases where I need a foreign key.
So... here some solutions, and I don't know if they're really good:
A) Don't use auto_increment keys! Use a key table?
Key Table would have two fields: table_name, next_key. Every time I need a key for a table to insert a new dataset, first I ask for the next_key by accessing a special static KeyGenerator class method. This does a SELECT and an UPDATE, if possible in one transaction (would that work?). Of course I would request that for every affected table. Next, I can INSERT my entire object graph in one transaction without playing ping-pong with the database, before I know the keys already in advance.
B) Using GUUID / UUID algorithm for keys?
These suppose to be really unique worldwide, and they're LARGE. I mean ... L_A_R_G_E. So a big amount of memory would go into these gigantic keys. Indexing will be hard, right? And data retrieval will be a pain for the database - at least I guess - integer keys are much faster to handle. On the other hand, these also provide some security: Visitors can't iterate anymore over all orders or all users or all pictures by just incrementing the id parameter.
C) Stick with auto_incremented keys?
Ok, if then, what about transactions like described in the example above? How can I solve that? Maybe by inserting a Ghost Row first and then doing an transaction with one UPDATE + n INSERTs?
D) What else?

When storing orders, you need transactions to prevent situations where only half your products are added to the database.
Depending on your database and your connector, the value returned by the last-insert-id function might be transaction-independent. For instance, with MySQL, mysql_insert_id returns the identifier for the last query from that particular client (without being affected by what other clients are doing concurrently).

Which database are you using?
Yes, typically inserting a record and then trying to select it again to find the auto-generated key is bad, especially if you are using a naive select max(id) from table query. This is because as soon as two threads are creating records max(id) may not actually return the last id your current thread used.
One way to avoid this is to create a sequence in the database. From your code you select sequence.NextValue then use that value to then execute your inserts (or you can craft a more complex SQL statement that does this selection and the inserts in one go). Sequences are atomic / thread-safe.
In MySQL you can ask for the last inserted id from the execution results which I believe will always give you the correct answer.

Sql Server supports SCOPE_IDENTITY (Transact-SQL) which should take care of your transaction issue and concurrency issue.
I would say stick with auto_increment.

(Assuming you are using MySQL)
"ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?)"
If you use MySQLs last_insert_id() function, you only see what happened in your session. So this is safe. You mention ths:
db.last_insert_id()
I don't know what framework or language it is, but I would assume that uses MySQL's last_insert_id() under the covers (if not, it is a pretty useless database abstraction fromework)
" I believe this prevents me from using transactions in almost all INSERT cases w"
I don't see why. Please explain.

D) Sequence
: may not be available in your DBMS, but if it is, solves your problem elegantly.
For Postgresql, have a look at Sequence Functions

There is no final and general answer to this question.
auto incrementing columns are easy to use when you add new records. To use them as foreign keys within the same transaction, they are not so straight forward. You need database specific commands to get the newly created key. This technology is common for certain databases, for instance sql server.
Sequences seem to be harder to use, because you need to get a key before you insert a row, but at the end its easier to use them as foreign keys. This technology is common for certain databases, for instance oracle.
When you use Hibernate or NHibernate, it is discouraged to use auto incrementing keys, because some optimizations are not possible anymore. Using a hi-lo algorithm which uses an additional table is recommended.
Guids are strong, for instance when sharing data between different databases, systems, disconnected scenarios, import / export etc. In many databases, most of the tables contain only a few hundred records, so memory and performance are not such an issue. When using NHibernate, you get an guid generator which produces sequential guids, because some databases perform better when keys are sequential.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.