I have hundred of thousands of elements to insert into a database. I realized calling an insert statement per element is way too costly and I need to reduce the overhead.
I recon each insert can have multiple data elements specified such as
INSERT INTO example (Parent, DataNameID) VALUES (1,1), (1,2)
My issue is that since the "DataName" keeps repeating itself for each element I thought it would optimize space if I stored these string names in another table and reference it.
However that causes problems for my idea of the bulk insert which now requires a way to actually evaluate the ID from the name before calling the bulk insert.
Any recommendations?
Should I simply de-normalize and insert the data every time as plain string to the table?
Also what is the limit of the size of the string as the string query amounts to almost 1.2 MB?
I am using PHP with MySQL backend
You haven't given us a lot of info on the database structure or size, but this may be a case where absolute normalization isn't worth the hassle.
However if you want to keep it normalized and the strings are already in your other table (let's call it datanames), you can do something like
INSERT INTO example (Parent, DataNameID) VALUES
(1, (select id from datanames where name='Foo')),
(1, (select id from datanames where name='Bar'))
First you should insert the name in the table.
Than call LAST_INSERT_ID() to get the id.
Than you can do your normal inserts.
If your table is MYisam based you can use INSERT DELAYED to improve performance: http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html
You might want to read up on load data (local) infile. It works great, I use it all the time.
EDIT: the answer only addresses the sluggishness of individual inserts. As #bemace points out, it says nothing about string IDs.
Related
I'm facing a challenge that has never come up for me before and having trouble finding an efficient solution. (Likely because I'm not a trained programmer and don't know all the terminology).
The challenge:
I have a feed of data which I need to use to maintain a mysql database each day. To do this requires checking if a record exists or not, then updating or inserting accordingly.
This is simple enough by itself, but running it for thousands of records -- it seems very inefficient to do a query for each record to check if it already exists in the database.
Is there a more efficient way than looping through my data feed and running an individual query for each record? Perhaps a way to somehow prepare them into one larger query (assuming that is a more efficient approach).
I'm not sure a code sample is needed here, but if there is any more information I can provide please just ask! I really appreciate any advice.
Edits:
#Sgt AJ - Each record in the data feed has a number of different columns, but they are indexed by an ID. I would check against that ID in the database to see if a record exists. In this situation I'm only updating one table, albeit a large table (30+ columns, mostly text).
What is the problem;
if problem is performance for checking, inserting & updating;
insert into your_table
(email, country, reach_time)
values ('mike#gmail.com','Italy','2016-06-05 00:44:33')
on duplicate key update reach_time = '2016-06-05 00:44:33';
I assume that, your key is email
Old style, dont use
if email exists
update your_table set
reach_time = '2016-06-05 00:44:33'
where email = 'mike#gmail.com';
else
insert into your_table
(email, country, reach_time)
values ('mike#gmail.com','Italy','2016-06-05 00:44:33')
It depends on how many 'feed' rows you have to load. If it's like 10 then doing them record by record (as shown by mustafayelmer) is probably not too bad. Once you go into the 100 and above region I would highly suggest to use a set-based approach. There is some overhead in creating and loading the staging table, but this is (very) quickly offset by the reduction of queries that need to be executed and the amount of round-trips going on over the network.
In short, what you'd do is :
-- create new, empty staging table
SELECT * INTO stagingTable FROM myTable WHERE 1 = 2
-- adding a PK to make JOIN later on easier
ALTER TABLE stagingTable ADD PRIMARY KEY (key1)
-- load the data either using INSERTS or using some other method
-- [...]
-- update existing records
UPDATE myTable
SET field1 = s.field1,
field2 = s.field2,
field3 = s.field3
FROM stagingTable s
WHERE s.key1 = myTable.key1
-- insert new records
INSERT myTable (key1, field1, field2, field3)
SELECT key1, field1, field2, field3
FROM stagingTable new
WHERE NOT EXISTS ( SELECT *
FROM myTable old
WHERE old.key1 = new.key1 )
-- get rid of staging table again
DROP TABLE stagingTable
to bring your data up to date.
Notes:
you might want to make the name of the stagingTable 'random' to avoid the situation where 2 'loads' are running in parallel and might start re-using the same table giving all kinds of weird results (and errors). Since all this code is 'generated' in php anyway you can simply add a timestamp or something to the tablename.
on MSSQL I would load all the data in the staging table using a bulk-insert mechanism. It can use bcp or BULK INSERT; .Net actually has the SqlBulkCopy class for this. Some quick googling shows me mysql has mysqlimport if you don't mind writing to a temp-file first and then loading from there, or you could use this to do big INSERT blocks rather than one by one. I'd avoid doing 10k inserts in one go though, rather do them per 100 or 500 or so, you'll need to test what's most efficient.
PS: you'll need to adapt my syntax a bit here and there, like I said I'm more familiar with MSSQLs T-SQL dialect. Also, it could be you can use the on duplicate key methodology on the staging table direclty, thus combining the UPDATE and INSERT in one command. [MSSQL uses MERGE for this, but it would look completely different so I won't bother to include that here.]
Good luck.
working on the PHP project related to web scraping and my aim is to store the data into the mysql database,i'm using unique key index on 3 indexes in 9 columns table and records are more than 5k.
should i check for unique data at program level like putting values in arrays and then comparing before inserting into database ?
is there any way so that i can speed up my database insertion ?
Never ever create a duplicate table this is a anti SQL pattern and it makes it more difficult to work with your data.
Maybe PDO and prepared statement will give you a little boost but dont expect wonders from it.
multible INSERT IGNORE may also give you a little boost but dont expect wonders from it.
You should generate a multiinsert query like so
INSERT INTO database.table (columns) VALUES (values),(values),(values)
Keep in mind to keep under the max packet size that mysql will have.
this way the index file have to be updated once.
You could create a duplicate of the table that you currently have except no indices on any field. Store the data in this table.
Then use events to move the data from the temp table into the main table. Once the data is moved to the main table then delete if from the temp table.
you can follow your updates with triger. You should do update table and you have to right trigger for this table.
use PDO, mysqli_* function, to increase insertion into database
You could use "INSERT IGNORE" in your query. That way the record will not be inserted if any unique constraints are violated.
Example:
INSERT IGNORE INTO table_name SET name = 'foo', value = 'bar', id = 12345;
I have a table users that has an auto-incrementing id column. For every new user, I essentially need to insert three rows into three different tables, which are 1. user, 2. user_content, and 3. user_preferences. The rows inserted into user_content and user_preferences are referenced by their id's which correspond to each user's id (held in user)
How do I accomplish this?
Should I do the INSERT INTO user query first, obtaining that auto-incremented id with last_insert_id(), and then the other two INSERT INTO queries using the obtained user id? Or, is there a more concise way to do this?
(note: I am using MySQL and PHP, and if it makes a difference, I am using a bigint to store the id values in all three tables.)
Thank you!
The approach that you've described (insert into user first, take the result of last_insert_id(), and use that to insert to the other two tables) is perfectly reasonable; I see nothing wrong with it.
It might be technically possible to combine the three queries and use the LAST_INSERT_ID() MySQL function to insert values to the other two tables, but this would be significantly more complex without any corresponding benefits. Not really worth doing, in other words.
I see 3 options:
PHP side using some *_last_insert_id (as you describe)
Create a trigger
Use a stored procedure.
There is a large table that holds millions of records. phpMyAdmin reports 1.2G size for the table.
There is a calculation that needs to be done for every row. The calculation is not simple (cannot be put in set col= calc format), it uses a stored function to get the values, so currently we have for each row a single update.
This is extremely slow and we want to optimize it.
Stored function:
https://gist.github.com/a9c2f9275644409dd19d
And this is called by this method for every row:
https://gist.github.com/82adfd97b9e5797feea6
This is performed on a off live server, and usually it is updated once per week.
What options we have here.
Why not setup a separate table to hold the computed values to take the load off your current table. It can have two columns: primary key for each row in your main table and a column for the computed value.
Then your process can be:
a) Truncate computedValues table - This is faster than trying to identify new rows
b) Compute the values and insert into the computed values table
c) So when ever you need your computed values you join to the computedValues table using a primary key join which is fast, and in case you need more computations well you just add new columns.
d) You can also update the main table using the computed values if you have to
Well, the problem doesn't seem to be the UPDATE query because no calculations are performed in the query itself. As it seems the calculations are performed first and then the UPDATE query is run. So the UPDATE should be quick enough.
When you say "this is extremely slow", I assume you are not referring to the UPDATE query but the complete process. Here are some quick thoughts:
As you said there are millions of records, updating those many entries is always time consuming. And if there are many columns and indexes defined on the table, it will add to the overhead.
I see that there are many REPLACE INTO queries in the function getNumberOfPeople(). These might as well be a reason for the slow process. Have you checked how efficient are these REPLACE INTO queries? Can you try removing them and then see if it has any impact on the UPDATE process.
There are a couple of SELECT queries too in getNumberOfPeople(). Check if they might be impacting the process and if so, try optimizing them.
In procedure updateGPCD(), you may try replacing SELECT COUNT(*) INTO _has_breakdown with SELECT COUNT(1) INTO _has_breakdown. In the same query, the WHERE condition is reading _ACCOUNT but this will fail when _ACCOUNT = 0, no?
On another suggestion, if it is the UPDATE that you think is slow because of reason 1, it might make sense to move the updating column gpcd outside usage_bill to another table. The only other column in the table should be the unique ID from usage_bill.
Hope the above make sense.
I am developing a web app using zend framework and the problem is about combining 2 sql queries for improving efficiency. My table structure is like this
>table message
id(int auto incr)
body(varchar)
time(datetime)
>table message_map
id(int auto incr)
message_id(forgain key from message table's id column)
sender(int ) comment 'user id of sender'
receiver(int) comment 'user id of receiver'
To get the code working, I am first inserting the message body and time to the message table and then using the last inserted id, I am inserting message sender and receiver to message_map table. Now what I want to do is to do this task in a single query as using one query will be more efficient. Is there any way to do so.
No there isn't. You can insert in only one table at once.
But I can't imagine you need to insert so much messages that performance really becomes an issue. Even with these separate statements, any database can easily insert thousands of records a minute.
bulk inserts
Of course, when inserting multiple records in the same table, that's a different matter. This is indeed possible in MySQL and it will make your query a lot faster. It will give you trouble, though, if you need to insert_ids from all those records.
mysql_insert_id() returns the first id that is inserted in the last insert statement, if it is a bulk insert. So you could query all id's that are >= that id. It should give you all records you just inserted, although the result may contain id's that other people inserted between your insert and the following query for those ids.
if its for only these two tables. Why dont you create a single table having all these columns in one as
>table message
id(int auto incr)
body(varchar)
sender(int ) comment 'user id of sender'
receiver(int) comment 'user id of receiver'
time(datetime)
then it will be like the way you want.
I agree with GolezTrol or otherwise if you want an optimized performance for your query perhaps you may choose to use Stored Procedures
Indeed combining those two inserts wouldn't be possible. While you van use JOIN in get queries, you can't combine insert queries. If your really worrying about performance, isn't there anyway to join those two tables together? As far is I can see there's no point in keeping them separated; there both about the message.
As stated before, executing a second insert query isn't that much of a server load by the way.
As others pointed out, you cannot really update multiple tables at once. And, you should not really be worried about performance, unless you are inserting thousands of messages in a short period of time.
Now, there is one thing you could worry about. Imagine, you first insert the message body, and then try to insert the receiver/sender IDs. Suppose first succeeds, while second (for whatever reason) fails. That would corrupt your data a bit. To avoid that, you can use transactions, e.g.
mysql_query("START TRANSACTION", $connection);
//your code
mysql_query("COMMIT", $connection);
That would ensure that either both inserts get into the database, or neither do. If you are using PDO, look into http://www.php.net/manual/en/pdo.begintransaction.php for examples.