Fastest way to fill a table

Fastest way to fill a table - php

I am trying to find the fastest way to insert data into a table (data from a select)
I always clear the table:
TRUNCATE TABLE table;
Then I do this to insert the data:
INSERT INTO table(id,total) (SELECT id, COUNT(id) AS Total FROM table2 GROUP BY id);
Someone told me I shouldn't do this.
He said this would be much faster:
CREATE TABLE IF NOT EXISTS table (PRIMARY KEY (inskey)) SELECT id, count(id) AS total FROM table2 GROUP BY id
Any ideas on this one?
I think my solution is cleaner, because I don't have to check for the table.
This will be ran in a cron job a few times a day
EDIT: I wasn't clear. The truncate is always ran. It's just the matter of the fastest why to insert all the data

I also think your solution is cleaner, plus the solution by "someone" looks to me to have some problems:
it does not actually delete old data that may be in the table
create table...select will create table columns with types based on what the select returns. That means changes in the table structure of table2 will propagate to table. That may or may not be what you want. It at least introduces an implicit coupling, which I find to be a bad idea.
As for performance, I see no reason why one should be faster than the other. So the usual advice applies: Choose the cleanest, most maintainable solution, test it, only optimize if performance is a problem :-).

Your solution would be my choice, the performance difference loss (if any, which I'm not sure because you don't drop/create the table and re-compute column type) is negligible and IMHO overweight cleanliness.

CREATE TABLE IF NOT EXISTS table (PRIMARY KEY (inskey))
SELECT id, count(id) AS total
FROM table2
GROUP BY
id
This will not delete old values from the table.
If that's what you want, it will be faster indeed.

Perhaps something has been lost in the translation between your Someone and yourself. One possibility s/he might have been referring to is DROP/SELECT INTO vs TRUNCATE/INSERT.
I have heard that the latter is faster as it is minimally logged (but then again, what's the eventual cost of the DROP here?). I have no hard stats to back this up.

I agree with "sleske"s suggestion in asking you test it and optimize the solution yourself. DIY!
Every self respecting DB will give you the opportunity to rollback your transaction.
1. Rolling back your INSERT INTO... will require DB to keep track of every row inserted into the table
2. Rolling back the CREATE TABLE... is super easy for the DB - Simply get rid of the table.
Now, if you were designing & coding the DB, which would be faster? 1 or 2?
"someone"s suggestion DOES have merit especially if you are using Oracle.
Regards,
Shiva

I'm sure that any time difference is indistinguishable, but yours is IMHO preferable because it's one SQL statement rather than two; any change in your INSERT statement doesn't require more work on the other statement; and yours doesn't require the host to validate that your INSERT matches the fields in the table.

From the manual: Beginning with MySQL 5.1.32, TRUNCATE is treated for purposes of binary logging and replication as DROP TABLE followed by CREATE TABLE — that is, as DDL rather than DML. This is due to the fact that, when using InnoDB and other transactional storage engines where the transaction isolation level does not allow for statement-based logging (READ COMMITTED or READ UNCOMMITTED), the statement was not logged and replicated when using STATEMENT or MIXED logging mode.
You can simplify your insert to:
INSERT INTO table
( SELECT id, COUNT(id) FROM table2 GROUP BY id );

Related

How to avoid inserting duplicates into MySQL?

I have an analytics platform with lots of users and hundreds of inserting clicks / minute.
Sometimes I see that the exact same click is inserted to the Database within the same second and it becomes a duplicate of the other.
I have a system which checks if the table has the same value and not letting the other inserted if it finds one.
However in this case it looks to me that they're inserted into the DB in the exact same milisecond.
What can I do here?

My favorite: insert ignore myTable (col1, col2, ...) ...
where unique key(s) are setup beforehand to forbid the insert. It would appear that you do not care so much that it was previously inserted as much as you care that the end result is not dupes.
Note: the unique keys can be multi-column keys (composites)
A word of warning about insert ignore: it should not be implemented without careful thought of its ramifications for sensitive systems that need to know that the row was truly already there. It is ideal for "make sure it is there".
Option B: One could look into intention locks, like here, but crafted for your particular use-case. Steer toward INNODB row-level locking that is swifty, and certainly not table locks. Most things come with a trade-off. The downside of locking is diminished concurrency.
Option C: For the faint-of-heart (sometimes me). And this is what I would do if hired out and wish not to have peer backlash later. Perform an Insert ... on Duplicate Key Update (IODKU), and have a bogus column like touches that is an int that you increment for the Update part of the IODKU. Example below:
insert myTable (col1, col2, col3) values (p1,p2,p3)
on duplicate key update touches=touches+1;
That above would be in a most minimalist form. A view below is what I use in C# where I care about more columns in the "update part of IODKU", but just to show that, if it benefits anyone:
A final thought on IODKU: it is mandatory to have a unique key (primary or just unique) that causes the "clash" to occur. Thus, the statement knows whether or not to perform the insert or the update. Without such a unique key clash, a new row will be inserted.
Back to the op issue, the reason your system probably already had the row there was due to high concurrency use without locking.

If the system's architecture allows it I would create two-tier solution. First, a temporary table where duplicate data would be inserted. The temporary table's name can contain a sharding parameter, for example, an hour number. The system will periodically export data from temporary tables into the main storage table, discarding duplicate data. Then it can discard the temporary tables.

PHP & MySQL, efficient way to check for rows many times in quick succession?

I'm facing a challenge that has never come up for me before and having trouble finding an efficient solution. (Likely because I'm not a trained programmer and don't know all the terminology).
The challenge:
I have a feed of data which I need to use to maintain a mysql database each day. To do this requires checking if a record exists or not, then updating or inserting accordingly.
This is simple enough by itself, but running it for thousands of records -- it seems very inefficient to do a query for each record to check if it already exists in the database.
Is there a more efficient way than looping through my data feed and running an individual query for each record? Perhaps a way to somehow prepare them into one larger query (assuming that is a more efficient approach).
I'm not sure a code sample is needed here, but if there is any more information I can provide please just ask! I really appreciate any advice.
Edits:
#Sgt AJ - Each record in the data feed has a number of different columns, but they are indexed by an ID. I would check against that ID in the database to see if a record exists. In this situation I'm only updating one table, albeit a large table (30+ columns, mostly text).

What is the problem;
if problem is performance for checking, inserting & updating;
insert into your_table
(email, country, reach_time)
values ('mike#gmail.com','Italy','2016-06-05 00:44:33')
on duplicate key update reach_time = '2016-06-05 00:44:33';
I assume that, your key is email
Old style, dont use
if email exists
update your_table set
reach_time = '2016-06-05 00:44:33'
where email = 'mike#gmail.com';
else
insert into your_table
(email, country, reach_time)
values ('mike#gmail.com','Italy','2016-06-05 00:44:33')

It depends on how many 'feed' rows you have to load. If it's like 10 then doing them record by record (as shown by mustafayelmer) is probably not too bad. Once you go into the 100 and above region I would highly suggest to use a set-based approach. There is some overhead in creating and loading the staging table, but this is (very) quickly offset by the reduction of queries that need to be executed and the amount of round-trips going on over the network.
In short, what you'd do is :
-- create new, empty staging table
SELECT * INTO stagingTable FROM myTable WHERE 1 = 2
-- adding a PK to make JOIN later on easier
ALTER TABLE stagingTable ADD PRIMARY KEY (key1)
-- load the data either using INSERTS or using some other method
-- [...]
-- update existing records
UPDATE myTable
SET field1 = s.field1,
field2 = s.field2,
field3 = s.field3
FROM stagingTable s
WHERE s.key1 = myTable.key1
-- insert new records
INSERT myTable (key1, field1, field2, field3)
SELECT key1, field1, field2, field3
FROM stagingTable new
WHERE NOT EXISTS ( SELECT *
FROM myTable old
WHERE old.key1 = new.key1 )
-- get rid of staging table again
DROP TABLE stagingTable
to bring your data up to date.
Notes:
you might want to make the name of the stagingTable 'random' to avoid the situation where 2 'loads' are running in parallel and might start re-using the same table giving all kinds of weird results (and errors). Since all this code is 'generated' in php anyway you can simply add a timestamp or something to the tablename.
on MSSQL I would load all the data in the staging table using a bulk-insert mechanism. It can use bcp or BULK INSERT; .Net actually has the SqlBulkCopy class for this. Some quick googling shows me mysql has mysqlimport if you don't mind writing to a temp-file first and then loading from there, or you could use this to do big INSERT blocks rather than one by one. I'd avoid doing 10k inserts in one go though, rather do them per 100 or 500 or so, you'll need to test what's most efficient.
PS: you'll need to adapt my syntax a bit here and there, like I said I'm more familiar with MSSQLs T-SQL dialect. Also, it could be you can use the on duplicate key methodology on the staging table direclty, thus combining the UPDATE and INSERT in one command. [MSSQL uses MERGE for this, but it would look completely different so I won't bother to include that here.]
Good luck.

cron or TRIGGER on continually INSERTing table1 or CURSOR (or alternative) to UPDATE 1m row InnoDB table2 without locking?

Does anyone have any recommendations how to implement this?
table1 will constantly be INSERTed into. This necessitates that every row on table2 be UPDATEd upon each table1 INSERT. Also, an algorithm that I don't know if MySQL would be best responsible for (vs PHP calculation speed) also has to be applied to each row of table2.
I wanted to have PHP handle it whenever the user did the INSERT, but I found out that PHP pages are not persistent after servering the connection to the user (or so I understand, please tell me that's wrong so I can go that route).
So now my problem is that if I use a total table UPDATE in a TRIGGER, I'll have locks galore (or so I understand from InnoDB's locking when UPDATing an entire table with a composite primary key since part of that key will be UPDATEd).
Now, I'm thinking of using a cron job, but I'd rather they fire upon a user's INSERT on table1 instead of on a schedule.
So I was thinking maybe a CURSOR...
What way would be fastest and "ABSOLUTELY" NO LOCKING on table2?
Many thanks in advance!
Table structure
table2 is all INTs for speed. However, it has a 2 column primary key. 1 of those columns is what's being UPDATEd. That key is for equally important rapid SELECTs.
table1 averages about 2.5x the number of rows of table2.
table2 is actually very small, ~200mb.

First of all: What you try is close to impossible - I don't know of an RDBMS, that can escalate INSERTs into one table into UPDATEs of another with "ABSOLUTELY NO LOCKING".
That said:
my first point of research would be, whether the schema could be overhauled to optimize this hotspot away.
if this cannot be achieved, you might want to look into making table2 an in-memory type that can be recreated from existing data (such as keeping snapshots of it together with the max PK of table1 and rolling forward if a DB restart is required). Since you need to update all rows on every INSERT into table1 it cannot be very big.
Next point of research would be to put the INSERT and the UPDATE into a stored procedure, that is called by the insertion logic. This would make a runaway situation with the resulting locking hell on catchup much less likely.

How MySQL manage multiple queries from multiple users simultaneously?

Just to give you an example:
I have a PHP script that manages users votes.
When a user votes, the script makes a query to check if someone has already voted for the same ID/product. If nobody has voted, then it makes another query and insert the ID into a general ID votes table and another one to insert the data into a per user ID votes table. And this kind of behavior is repeated in other kind of scripts.
The question is, if two different users votes simultaneously its possible that the two instances of the code try to insert a new ID (or some similar type of query) that will give an error??
If yes, how I prevent this from happening?
Thanks?
Important note: I'm using MyISAM! My web hosting don't allow InnoDB.

The question is, if two different users votes simultaneously its possible that the two instances of the
code try to insert a new ID (or some similar type of query) that will give an erro
Yes, you might end up with two queries doing the insert. Depending on the constraints on the table, one of them will either generate an error, or you'll end up with two rows in your database.
You could solve this, I believe, with applying some locking;
e.g. if you need to add a vote to the product with id theProductId:(pseudo code)
START TRANSACTION;
//lock on the row for our product id (assumes the product really exists)
select 1 from products where id=theProductId for update;
//assume the vote exist, and increment the no.of votes
update votes set numberOfVotes = numberOfVotes + 1 where productId=theProductId ;
//if the last update didn't affect any rows, the row didn't exist
if(rowsAffected == 0)
insert into votes(numberOfVotes,productId) values(1,theProductId )
//insert the new vote in the per user votes
insert into user_votes(productId,userId) values(theProductId,theUserId);
COMMIT;
Some more info here
MySQL offers another solution as well, that might be applicable here, insert on duplicate
e.g. you might be able to just do:
insert into votes(numberOfVotes,productId) values(1,theProductId ) on duplicate key
update numberOfVotes = numberOfVotes + 1;
If your votes table have a unique key on the product id column, the above will
do an insert if the particular theProductId doesn't exist, otherwise it will do an update, where it increments the numberOfVotes column by 1
You could probably avoid a lot of this if you created a row in the votes table at the same time you added the product to the database. That way you could be sure there's always a row for your product, and just issue an UPDATE on that row.

The question is, if two different
users votes simultaneously its
possible that the two instances of the
code try to insert a new ID (or some
similar type of query) that will give
an error??
Yes, in general this is possible. This is an example of a very common problem in concurrent systems, called a race condition.
Avoiding it can be rather tricky, but in general you need to make sure that the operations cannot interleave in the way you describe, e.g. by locking the database for a while.
There are several practical solutions to this, all with their own advantages and risks (e.g. dead locks). See the Wikipedia article for a discussion and further pointers to information.

The easiest way:
LOCK TABLES table1 WRITE, table2 WRITE, table3 WRITE
-- check for record, insert if not exists, etc...
UNLOCK TABLES
If voting does not occur many times per second, then the above should be sufficient.
InnoDB tables offer transactions, which might be useful here as well. Others have already commented on it, so I won't go into any detail.
Alternatively, you could solve it at the code level via using some sort of shared memory mutex that disables concurrent execution of that section of PHP code.

This when the singleton pattern come in handy. It ensure that a code is executed only by one process at an instant.
http://en.wikipedia.org/wiki/Singleton_pattern
You have to make a singleton class for the database access this will prevent you from the type of error you describing.
Cheers.

Key problem: Which key strategy should I use in my database?

Problem: When I use an auto-incrementing primary key in my database, this happens all the time:
I want to store an Order with 10 Items. The ordered Items belong to the Order. So I store the order, ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?), and then store the 10 Items with the foreign key (order_id).
So I always have to do:
INSERT ...
last_inserted_id = db.lastInsertId();
INSERT ...
INSERT ...
INSERT ...
and I believe this prevents me from using transactions in almost all INSERT cases where I need a foreign key.
So... here some solutions, and I don't know if they're really good:
A) Don't use auto_increment keys! Use a key table?
Key Table would have two fields: table_name, next_key. Every time I need a key for a table to insert a new dataset, first I ask for the next_key by accessing a special static KeyGenerator class method. This does a SELECT and an UPDATE, if possible in one transaction (would that work?). Of course I would request that for every affected table. Next, I can INSERT my entire object graph in one transaction without playing ping-pong with the database, before I know the keys already in advance.
B) Using GUUID / UUID algorithm for keys?
These suppose to be really unique worldwide, and they're LARGE. I mean ... L_A_R_G_E. So a big amount of memory would go into these gigantic keys. Indexing will be hard, right? And data retrieval will be a pain for the database - at least I guess - integer keys are much faster to handle. On the other hand, these also provide some security: Visitors can't iterate anymore over all orders or all users or all pictures by just incrementing the id parameter.
C) Stick with auto_incremented keys?
Ok, if then, what about transactions like described in the example above? How can I solve that? Maybe by inserting a Ghost Row first and then doing an transaction with one UPDATE + n INSERTs?
D) What else?

When storing orders, you need transactions to prevent situations where only half your products are added to the database.
Depending on your database and your connector, the value returned by the last-insert-id function might be transaction-independent. For instance, with MySQL, mysql_insert_id returns the identifier for the last query from that particular client (without being affected by what other clients are doing concurrently).

Which database are you using?
Yes, typically inserting a record and then trying to select it again to find the auto-generated key is bad, especially if you are using a naive select max(id) from table query. This is because as soon as two threads are creating records max(id) may not actually return the last id your current thread used.
One way to avoid this is to create a sequence in the database. From your code you select sequence.NextValue then use that value to then execute your inserts (or you can craft a more complex SQL statement that does this selection and the inserts in one go). Sequences are atomic / thread-safe.
In MySQL you can ask for the last inserted id from the execution results which I believe will always give you the correct answer.

Sql Server supports SCOPE_IDENTITY (Transact-SQL) which should take care of your transaction issue and concurrency issue.
I would say stick with auto_increment.

(Assuming you are using MySQL)
"ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?)"
If you use MySQLs last_insert_id() function, you only see what happened in your session. So this is safe. You mention ths:
db.last_insert_id()
I don't know what framework or language it is, but I would assume that uses MySQL's last_insert_id() under the covers (if not, it is a pretty useless database abstraction fromework)
" I believe this prevents me from using transactions in almost all INSERT cases w"
I don't see why. Please explain.

D) Sequence
: may not be available in your DBMS, but if it is, solves your problem elegantly.
For Postgresql, have a look at Sequence Functions

There is no final and general answer to this question.
auto incrementing columns are easy to use when you add new records. To use them as foreign keys within the same transaction, they are not so straight forward. You need database specific commands to get the newly created key. This technology is common for certain databases, for instance sql server.
Sequences seem to be harder to use, because you need to get a key before you insert a row, but at the end its easier to use them as foreign keys. This technology is common for certain databases, for instance oracle.
When you use Hibernate or NHibernate, it is discouraged to use auto incrementing keys, because some optimizations are not possible anymore. Using a hi-lo algorithm which uses an additional table is recommended.
Guids are strong, for instance when sharing data between different databases, systems, disconnected scenarios, import / export etc. In many databases, most of the tables contain only a few hundred records, so memory and performance are not such an issue. When using NHibernate, you get an guid generator which produces sequential guids, because some databases perform better when keys are sequential.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.