How to improve insert performance on a billion row table?

How to improve insert performance on a billion row table? - php

I have billion row table that no longer fits in the memory.
When I insert new rows in bulk, the overhead of recounting the primary index, kills the performance. I HAVE to have this index because otherwise SELECT statements are really slow. But since the inserts come in a random order, with each row inserted, the data has to be written in different area of the disk.
And since the HDD is capped at 200 IO operations per second, this slows the inserting to a crawl.
Can I "have my cake and eat it" at the same time in this situation? Maybe by creating another table in which the data would be grouped by different column ( by having a different primary key )? But this seems wasteful to me and I don't even know if that would help...
Or maybe I could use some staging table? Insert there 1,000,000 rows and then insert them to the target table, grouped up by the primary key?
Am I doomed?
EDIT:
I've partitioned the table horizontally.
When I removed the primary key on this field that I need and placed it on the autoincrement field, the inserts were blazingly fast.
Unfortunately, since the data on disk is placed by the primary key value, this killed the select performance... because selects don't query based on the autoincrement value but rather on the PK value.
So either I insert rows fast or I select them fast. Isn't there any solution that could help in both cases?

.When you insert new row each time it will do indexing after data insert. It's take more time.
You can use
START TRANSACTION
...You r insert query...
COMMIT

Try Like this
mysql_query("START TRANSACTION");
your insert query
mysql_query("COMMIT");

Related

MySQL auto increment and select count

I am having a strange issue with MySQL Query, i am having a table with the fields
slno,mobileno , contractor with slno as primary key and auto
increment sequence 1
, while test say uptil 100 records the count and the autoincrement values are same ,
so i truncated the table to reset the autoincrement and inserted a huge excel file with around 40k data via php, then issued select query which yields
the max of slno is 40000 as expected but the count shows 39920
, I am just amused and tried to find over google , may be my lack of keyword search ability prevented me from finding result, so i am posting in here, for ref added screen shot, Any ideas and clarifications. Thanks
EDIT:
min slno is 1
EDIT :
A related question with solution to find gap in auto number in mysql has been asked and solved here.

There are specific cases in which auto-incremented values can be lost. One example is if you roll back an insertion. As per the doco:
"Lost" auto-increment values and sequence gaps
In all lock modes (0, 1, and 2), if a transaction that generated auto-increment values rolls back, those auto-increment values are "lost". Once a value is generated for an auto-increment column, it cannot be rolled back, whether or not the "INSERT-like" statement is completed, and whether or not the containing transaction is rolled back. Such lost values are not reused. Thus, there may be gaps in the values stored in an AUTO_INCREMENT column of a table.
In that case, although the insert is backed out, the auto-increment may not be. That would certainly allow for the possibility that your bulk insertion from Excel is occasionally failing and retrying, with the subsequent retry working. It really depends on how your insertion process works.
In any case, assuming those values will always be contiguous is actually a bad assumption to make.
This is because, even if insertions were guaranteed to be contiguous, it's possible to delete rows which would result in gaps appearing. You can certainly fix this each time you delete (or bulk insert for that matter) but the workload is high - you basically have to find gaps and then "move" higher entries into those gaps.
This movement is likely to be non-trivial as it's most likely that there will be other tables holding key look-ups to that column, and each of those will need to be changed as well.
So the best use case for an auto-increment field is simply to provide a unique identifier for the row where no other one exists and not to be necessarily contiguous.

Load a full list of IDs from DB or perform one record at a time ? What's best?

I migrate a custom made web site to WordPress and first I have to migrate the data from the previous web site, and then, every day I have to perform some data insertion using an API.
The data I like to insert, comes with a unique ID, representing a single football game.
In order to avoid inserting the same game multiple times, I made a db table with the following structure:
CREATE TABLE `ss_highlight_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`highlight_id` int(10) unsigned zerofill NOT NULL DEFAULT '0000000000',
PRIMARY KEY (`id`),
UNIQUE KEY `highlight_id_UNIQUE` (`highlight_id`),
KEY `highlight_id_INDEX` (`highlight_id`) COMMENT 'Contains a list with all the highlight IDs. This is used as index, and dissalow the creation of double records.'
) ENGINE=InnoDB AUTO_INCREMENT=2967 DEFAULT CHARSET=latin1
and when I try to insert a new record in my WordPress db, I first like to lookup this table, to see if the ID already exists.
The question now :)
What's preferable ? To load all the IDs using a single SQL query, and then use plain PHP to check if the current game ID exists, or is it better to query the DB for any single row I insert ?
I know that MySQL Queries are resource expensive, but from the other side, currently I have about 3k records in this table, and this will move over 30 - 40k in the next few year, so I don't know if it's a good practice to load all of those records in PHP ?
What is your opinion / suggestion ?
UPDATE #1
I just found that my table has 272KiB size with 2966 row. This means that in the near feature it seems that will have a size of about ~8000KiB+ size, and going on.
UPDATE #2
Maybe I have not make it too clear. For first insertion, I have to itterate a CSV file with about 12K records, and after the CSV insertion every day I will insert about 100 - 200 records. All of those records requiring a lookup in the table with the IDs.
So the excact question is, is it better to create a 12K queries in MySQL at CSV insertion and then about 100 - 200 MySQL Queries every day, or just load the IDs in server memory, and use PHP for the lookup ?

Your table has a column id which is auto_increment, what that means is there is no need to insert anything in that column. It will fill it itself.

highlight_id is UNIQUE, so it may as well be the PRIMARY KEY; get rid if id.
A PRIMARY KEY is a UNIQUE key is an INDEX. So this is redundant:
KEY `highlight_id_INDEX` (`highlight_id`)
Back to your question... SQL is designed to do things in batches. Don't defeat that by doing things one row at a time.
How can the table be 272KiB size if it has only two columns and 2966 rows? If there are more columns in the table; show them. There are often good clues of what you are doing, and how to make it more efficient.
2966 rows is 'trivial'; you will have to look closely to see performance differences.
Loading from CSV...
If this is a replacement, use LOAD DATA, building a new table, then RENAME to put it into place. One CREATE, one LOAD, one RENAME, one DROP. Much more efficient than 100 queries of any kind.
If the CSV is updates/inserts, LOAD into a temp table, then do INSERT ... ON DUPLICATE KEY UPDATE ... to perform the updates/inserts into the real table. One CREATE, one LOAD, one IODKU. Much more efficient than 100 queries of any kind.
If the CSV is something else, please elaborate.

MySQL concurrent multi-row inserts - Insert ID Assumption

In my application, I have a lot of foreign key dependencies, and often insert large numbers of rows. What I have done up until now is run a single insert at a time, and record the insert ID. This tends to take a long time when inserting a large number of rows, even when apache and mysql are run on the same server.
My question is, if I were to alter my application to add a number of rows with a single INSERT, would I be able to assume the ids of each row based strictly upon the last insert id returned by the mysql connection? The issue is that there is the occasional situation where more than one person will be putting large amounts of information into the database at a time.
From what I have been able to determine, it seems safe to assume that when you insert 500 rows, your insert ids will range from (lastInsertID) to (lastInsertID+499), regardless of whether a query from another connection has begun or ended in the time it took to complete, but I want to be sure this is accepted as safe practice.
I am primarily running InnoDB, but there is the occasional MyISAM in there as well.
Thanks All!
-Jer

The mysql_insert_id and the now recommended alternative mysqli_insert_id returns the first entry id from the affected table's AUTO_INCREMENT column of the last query you've ran.
Mysql grouped INSERT statements are inserted iteratively from the first entry AUTO_INCREMENT index. So yes it is safe to assume that the entries inserted from a single INSERT statement are contiguous.
Comment from the Mysql reference "The ID that was generated is maintained in the server on a per-connection basis. This means that the value returned by the function to a given client is the first AUTO_INCREMENT value generated for most recent statement affecting an AUTO_INCREMENT column by that client. This value cannot be affected by other clients, even if they generate AUTO_INCREMENT values of their own. This behavior ensures that each client can retrieve its own ID without concern for the activity of other clients, and without the need for locks or transactions.

Optimized ways to update every record in a table after running some calculations on each row

There is a large table that holds millions of records. phpMyAdmin reports 1.2G size for the table.
There is a calculation that needs to be done for every row. The calculation is not simple (cannot be put in set col= calc format), it uses a stored function to get the values, so currently we have for each row a single update.
This is extremely slow and we want to optimize it.
Stored function:
https://gist.github.com/a9c2f9275644409dd19d
And this is called by this method for every row:
https://gist.github.com/82adfd97b9e5797feea6
This is performed on a off live server, and usually it is updated once per week.
What options we have here.

Why not setup a separate table to hold the computed values to take the load off your current table. It can have two columns: primary key for each row in your main table and a column for the computed value.
Then your process can be:
a) Truncate computedValues table - This is faster than trying to identify new rows
b) Compute the values and insert into the computed values table
c) So when ever you need your computed values you join to the computedValues table using a primary key join which is fast, and in case you need more computations well you just add new columns.
d) You can also update the main table using the computed values if you have to

Well, the problem doesn't seem to be the UPDATE query because no calculations are performed in the query itself. As it seems the calculations are performed first and then the UPDATE query is run. So the UPDATE should be quick enough.
When you say "this is extremely slow", I assume you are not referring to the UPDATE query but the complete process. Here are some quick thoughts:
As you said there are millions of records, updating those many entries is always time consuming. And if there are many columns and indexes defined on the table, it will add to the overhead.
I see that there are many REPLACE INTO queries in the function getNumberOfPeople(). These might as well be a reason for the slow process. Have you checked how efficient are these REPLACE INTO queries? Can you try removing them and then see if it has any impact on the UPDATE process.
There are a couple of SELECT queries too in getNumberOfPeople(). Check if they might be impacting the process and if so, try optimizing them.
In procedure updateGPCD(), you may try replacing SELECT COUNT(*) INTO _has_breakdown with SELECT COUNT(1) INTO _has_breakdown. In the same query, the WHERE condition is reading _ACCOUNT but this will fail when _ACCOUNT = 0, no?
On another suggestion, if it is the UPDATE that you think is slow because of reason 1, it might make sense to move the updating column gpcd outside usage_bill to another table. The only other column in the table should be the unique ID from usage_bill.
Hope the above make sense.

Does MySQL (MyISAM) fill table holes in a multirow insert?

I'm working on a project for which I need to frequently insert ~500 or so records at a remote location. I will be doing this in a single INSERT to minimize network traffic. I do, however, need to know the exact id field (AUTO_INCREMENT) values.
My web searches seem to indicate I could probably use the last_insert_id and calculate all the id values from there. However, this wouldn't work if the rows get ids that are all over the place.
Could anyone please clarify what would or should happen, and if the mathematical solution is safe?

A multirow insert is an atomic operation in MySQL (both MyISAM and InnoDB). Since the table will be locked for writing during this operations, no other rows will be inserted/updated during it's execution.
This means IDs will in fact be consecutive (unless auto-increment-increment option is set to something different than 1

Auto increment does exactly that, it auto-increments - i.e. each new row next the numerically next ID. MySQL does not re-use IDs of rows that were deleted.
Your solution is safe because write operations aquire a table lock, so no other inserts can happen while your operation completes - so you will get n contiguous auto-increment values for n inserted rows.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.