How to avoid inserting duplicates into MySQL?

How to avoid inserting duplicates into MySQL? - php

I have an analytics platform with lots of users and hundreds of inserting clicks / minute.
Sometimes I see that the exact same click is inserted to the Database within the same second and it becomes a duplicate of the other.
I have a system which checks if the table has the same value and not letting the other inserted if it finds one.
However in this case it looks to me that they're inserted into the DB in the exact same milisecond.
What can I do here?

My favorite: insert ignore myTable (col1, col2, ...) ...
where unique key(s) are setup beforehand to forbid the insert. It would appear that you do not care so much that it was previously inserted as much as you care that the end result is not dupes.
Note: the unique keys can be multi-column keys (composites)
A word of warning about insert ignore: it should not be implemented without careful thought of its ramifications for sensitive systems that need to know that the row was truly already there. It is ideal for "make sure it is there".
Option B: One could look into intention locks, like here, but crafted for your particular use-case. Steer toward INNODB row-level locking that is swifty, and certainly not table locks. Most things come with a trade-off. The downside of locking is diminished concurrency.
Option C: For the faint-of-heart (sometimes me). And this is what I would do if hired out and wish not to have peer backlash later. Perform an Insert ... on Duplicate Key Update (IODKU), and have a bogus column like touches that is an int that you increment for the Update part of the IODKU. Example below:
insert myTable (col1, col2, col3) values (p1,p2,p3)
on duplicate key update touches=touches+1;
That above would be in a most minimalist form. A view below is what I use in C# where I care about more columns in the "update part of IODKU", but just to show that, if it benefits anyone:
A final thought on IODKU: it is mandatory to have a unique key (primary or just unique) that causes the "clash" to occur. Thus, the statement knows whether or not to perform the insert or the update. Without such a unique key clash, a new row will be inserted.
Back to the op issue, the reason your system probably already had the row there was due to high concurrency use without locking.

If the system's architecture allows it I would create two-tier solution. First, a temporary table where duplicate data would be inserted. The temporary table's name can contain a sharding parameter, for example, an hour number. The system will periodically export data from temporary tables into the main storage table, discarding duplicate data. Then it can discard the temporary tables.

Related

SQL - auto increment withing group inside one table [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

How to implement a global incrementing id for multiple tables in SQL?

I decided back when I was coding to have different tables for each type of content. Now I am stuck solving this. Basically my notification system ranks the newest content by its timestamp currently. This is inaccurate however because there is a small chance that someone would submit content at the same time as another person, and incorrect ranking would occur.
Now if I had all my content in a single table, I would simply rank it by an auto-incrementing variable. Is there a way to implement this auto-increment integer across multiple tables (e.g. When something is inserted into table1, id=0, something is inserted into table2, id=1). Or do I have to recode all my stuff into a single table.
NOTE:
The reason I have content in multiple tables is because its organized and it would reduce load stress. I don't really care about the organization anymore, because I can just access the data through a GUI I coded, I'm just wondering about the load stress.
EDIT:
I'm using PHP 5 with MySQL.

Your question, particularly the need for ID spanning over multiple tables, is clearly signalizing that your database design needs change. You should make one table for all content types (as a generalization), with autoincrementing ID. Then, for each particular content type, you can define other table (equivalent of inheritance in OOP) with extra fields, and foreign key pointing to the basic table.
In other words, you need something like inheritance in SQL.

You can create a table with auto increment id just to keep track of ids. Your program would do an insert on that table, get the id, use it as necessary.
Something along the lines of:
function getNextId() {
$res = mysql_query("INSERT INTO seq_table(id) VALUES (NULL)");
$id = mysql_insert_id();
if ($id % 10 == 0) {
mysql_query("DELETE FROM seq_table");
}
return $id;
}
Where seq_table is a table that you've to create just to get the ids. Make it a function so it can be used whenever you need. Every 10 ids generated I delete all generated ids, anyway you don't need them there. I don't delete every time since it would slow down. If another insert happen in the meantime and I delete 11 or more records, it doesn't affect the behaviour of this procedure. It's safe for the purpose it has to reach.
Even if the table is empty new ids will just keep on growing since you've declared id as auto-increment.
UPDATE: I want to clarify why the ID generation is not wrapped in a transaction and why it shouldn't.
If you generate an auto id and you rollback the transaction, the next auto id, will be incremented anyway. Excerpt from a MySQL bug report:
[...] this is not a bug but expected behavior that happens in every RDBMS we know. Generated values are not a part of transaction and they don't care about other statements.
Getting the ID with this procedure is perfectly thread safe. Your logic after the ID is obtained should be wrapped in a transaction, especially if you deal with multiple tables.
Getting a sequence in this way isn't a new concept, for instance, the code of metabase_mysql.php which is a stable DB access library has a method called GetSequenceNextValue() which is quite similar.

In a single table, you could have a field for the content type and clustered index that includes the content type field. This effectively keeps all of one content type in one place on the disc, and another content type in another place, etc. (It's actually organised into pages, but this physical organisation is still true.)
Assuming that each content type has the same fields, this would likely meet your needs and behave similarly to multiple tables. In some cases you may even find that, with appropriate indexes, a single table solution can be faster, more convenient and maintainable, etc. Such as trying to create global unique identifiers across all content types.
If you're unable to merge these back into a single table, you could create a central link table...
CREATE TABLE content_link (
id INT IDENTITY(1,1), -- MS SQL SERVER syntax
content_type INT,
content_id INT -- The id from the real table
)
As you insert into the content tables, also insert into the link table to create your globally unique id.
More simply, but even more manually, just hold a single value somewhere in the database. Whenever you need a new id, use that centrally stored value and increment it by one. Be sure to wrap the increment and collection in a single transaction to stop race conditions. (This can be done in a number of ways, depending on your flavor of SQL.)
EDIT
A couple of MySQL example lines of code from the web...
START TRANSACTION;
INSERT INTO foo (auto,text)
VALUES(NULL,'text'); # generate ID by inserting NULL
INSERT INTO foo2 (id,text)
VALUES(LAST_INSERT_ID(),'text'); # use ID in second table
COMMIT TRANSACTION;
Personally, I'd actually store the value in a variable, commit the transaction, and then continue with my business logic. This would keep the locks on the tables to a minimum.

You could have a separate ID table, insert into that, and use the newly-inserted ID.
e.g.
CREATE TABLE ids (INT UNSIGNED AUTO INCREMENT PRIMARY KEY, timeadded DATETIME);
In the script:
<?php
$r = mysql_query('INSERT INTO ids (timeadded) VALUES (NOW())');
$id = mysql_insert_id();
mysql_query("INSERT INTO someOtherTable (id, data) VALUES ('$id', '$data)");
Add error checking etc. to taste.
The MySQL manual states:
The ID that was generated is maintained in the server on a
per-connection basis. This means that the value returned by the
function to a given client is the first AUTO_INCREMENT value generated
for most recent statement affecting an AUTO_INCREMENT column by that
client. This value cannot be affected by other clients, even if they
generate AUTO_INCREMENT values of their own. This behavior ensures
that each client can retrieve its own ID without concern for the
activity of other clients, and without the need for locks or
transactions.
(Source) So I don't think concerns about ACID complians are a problem.

Ids from mysql massive insert from simultaneous sources

I've got an application in php & mysql where the users writes and reads from a particular table. One of the write modes is in a batch, doing only one query with the multiple values. The table has an ID which auto-increments.
The idea is that for each row in the table that is inserted, a copy is inserted in a separate table, as a history log, including the ID that was generated.
The problem is that multiple users can do this at once, and I need to be sure that the ID loaded is the correct.
Can I be sure that if I do for example:
INSERT INTO table1 VALUES ('','test1'),('','test2')
that the ids generated are sequential?
How can I get the Id's that were just loaded, and be sure that those are the ones that were just loaded?
I've thinked of the LOCK TABLE, but the users shouldn't note this.
Hope I made myself clear...

Building an application that requires generated IDs to be sequential usually means you're taking a wrong approach - what happens when you have to delete a value some day, are you going to re-sequence the entire table? Much better to just let the values fall as they may, using a primary key to prevent duplication.

based on the current implementation of myisam and innodb, yes. however, this is not guaranteed to be so in the future, so i would not rely on it.

Key problem: Which key strategy should I use in my database?

Problem: When I use an auto-incrementing primary key in my database, this happens all the time:
I want to store an Order with 10 Items. The ordered Items belong to the Order. So I store the order, ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?), and then store the 10 Items with the foreign key (order_id).
So I always have to do:
INSERT ...
last_inserted_id = db.lastInsertId();
INSERT ...
INSERT ...
INSERT ...
and I believe this prevents me from using transactions in almost all INSERT cases where I need a foreign key.
So... here some solutions, and I don't know if they're really good:
A) Don't use auto_increment keys! Use a key table?
Key Table would have two fields: table_name, next_key. Every time I need a key for a table to insert a new dataset, first I ask for the next_key by accessing a special static KeyGenerator class method. This does a SELECT and an UPDATE, if possible in one transaction (would that work?). Of course I would request that for every affected table. Next, I can INSERT my entire object graph in one transaction without playing ping-pong with the database, before I know the keys already in advance.
B) Using GUUID / UUID algorithm for keys?
These suppose to be really unique worldwide, and they're LARGE. I mean ... L_A_R_G_E. So a big amount of memory would go into these gigantic keys. Indexing will be hard, right? And data retrieval will be a pain for the database - at least I guess - integer keys are much faster to handle. On the other hand, these also provide some security: Visitors can't iterate anymore over all orders or all users or all pictures by just incrementing the id parameter.
C) Stick with auto_incremented keys?
Ok, if then, what about transactions like described in the example above? How can I solve that? Maybe by inserting a Ghost Row first and then doing an transaction with one UPDATE + n INSERTs?
D) What else?

When storing orders, you need transactions to prevent situations where only half your products are added to the database.
Depending on your database and your connector, the value returned by the last-insert-id function might be transaction-independent. For instance, with MySQL, mysql_insert_id returns the identifier for the last query from that particular client (without being affected by what other clients are doing concurrently).

Which database are you using?
Yes, typically inserting a record and then trying to select it again to find the auto-generated key is bad, especially if you are using a naive select max(id) from table query. This is because as soon as two threads are creating records max(id) may not actually return the last id your current thread used.
One way to avoid this is to create a sequence in the database. From your code you select sequence.NextValue then use that value to then execute your inserts (or you can craft a more complex SQL statement that does this selection and the inserts in one go). Sequences are atomic / thread-safe.
In MySQL you can ask for the last inserted id from the execution results which I believe will always give you the correct answer.

Sql Server supports SCOPE_IDENTITY (Transact-SQL) which should take care of your transaction issue and concurrency issue.
I would say stick with auto_increment.

(Assuming you are using MySQL)
"ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?)"
If you use MySQLs last_insert_id() function, you only see what happened in your session. So this is safe. You mention ths:
db.last_insert_id()
I don't know what framework or language it is, but I would assume that uses MySQL's last_insert_id() under the covers (if not, it is a pretty useless database abstraction fromework)
" I believe this prevents me from using transactions in almost all INSERT cases w"
I don't see why. Please explain.

D) Sequence
: may not be available in your DBMS, but if it is, solves your problem elegantly.
For Postgresql, have a look at Sequence Functions

There is no final and general answer to this question.
auto incrementing columns are easy to use when you add new records. To use them as foreign keys within the same transaction, they are not so straight forward. You need database specific commands to get the newly created key. This technology is common for certain databases, for instance sql server.
Sequences seem to be harder to use, because you need to get a key before you insert a row, but at the end its easier to use them as foreign keys. This technology is common for certain databases, for instance oracle.
When you use Hibernate or NHibernate, it is discouraged to use auto incrementing keys, because some optimizations are not possible anymore. Using a hi-lo algorithm which uses an additional table is recommended.
Guids are strong, for instance when sharing data between different databases, systems, disconnected scenarios, import / export etc. In many databases, most of the tables contain only a few hundred records, so memory and performance are not such an issue. When using NHibernate, you get an guid generator which produces sequential guids, because some databases perform better when keys are sequential.

Fastest way to fill a table

I am trying to find the fastest way to insert data into a table (data from a select)
I always clear the table:
TRUNCATE TABLE table;
Then I do this to insert the data:
INSERT INTO table(id,total) (SELECT id, COUNT(id) AS Total FROM table2 GROUP BY id);
Someone told me I shouldn't do this.
He said this would be much faster:
CREATE TABLE IF NOT EXISTS table (PRIMARY KEY (inskey)) SELECT id, count(id) AS total FROM table2 GROUP BY id
Any ideas on this one?
I think my solution is cleaner, because I don't have to check for the table.
This will be ran in a cron job a few times a day
EDIT: I wasn't clear. The truncate is always ran. It's just the matter of the fastest why to insert all the data

I also think your solution is cleaner, plus the solution by "someone" looks to me to have some problems:
it does not actually delete old data that may be in the table
create table...select will create table columns with types based on what the select returns. That means changes in the table structure of table2 will propagate to table. That may or may not be what you want. It at least introduces an implicit coupling, which I find to be a bad idea.
As for performance, I see no reason why one should be faster than the other. So the usual advice applies: Choose the cleanest, most maintainable solution, test it, only optimize if performance is a problem :-).

Your solution would be my choice, the performance difference loss (if any, which I'm not sure because you don't drop/create the table and re-compute column type) is negligible and IMHO overweight cleanliness.

CREATE TABLE IF NOT EXISTS table (PRIMARY KEY (inskey))
SELECT id, count(id) AS total
FROM table2
GROUP BY
id
This will not delete old values from the table.
If that's what you want, it will be faster indeed.

Perhaps something has been lost in the translation between your Someone and yourself. One possibility s/he might have been referring to is DROP/SELECT INTO vs TRUNCATE/INSERT.
I have heard that the latter is faster as it is minimally logged (but then again, what's the eventual cost of the DROP here?). I have no hard stats to back this up.

I agree with "sleske"s suggestion in asking you test it and optimize the solution yourself. DIY!
Every self respecting DB will give you the opportunity to rollback your transaction.
1. Rolling back your INSERT INTO... will require DB to keep track of every row inserted into the table
2. Rolling back the CREATE TABLE... is super easy for the DB - Simply get rid of the table.
Now, if you were designing & coding the DB, which would be faster? 1 or 2?
"someone"s suggestion DOES have merit especially if you are using Oracle.
Regards,
Shiva

I'm sure that any time difference is indistinguishable, but yours is IMHO preferable because it's one SQL statement rather than two; any change in your INSERT statement doesn't require more work on the other statement; and yours doesn't require the host to validate that your INSERT matches the fields in the table.

From the manual: Beginning with MySQL 5.1.32, TRUNCATE is treated for purposes of binary logging and replication as DROP TABLE followed by CREATE TABLE — that is, as DDL rather than DML. This is due to the fact that, when using InnoDB and other transactional storage engines where the transaction isolation level does not allow for statement-based logging (READ COMMITTED or READ UNCOMMITTED), the statement was not logged and replicated when using STATEMENT or MIXED logging mode.
You can simplify your insert to:
INSERT INTO table
( SELECT id, COUNT(id) FROM table2 GROUP BY id );

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.