I'm working on a project for which I need to frequently insert ~500 or so records at a remote location. I will be doing this in a single INSERT to minimize network traffic. I do, however, need to know the exact id field (AUTO_INCREMENT) values.
My web searches seem to indicate I could probably use the last_insert_id and calculate all the id values from there. However, this wouldn't work if the rows get ids that are all over the place.
Could anyone please clarify what would or should happen, and if the mathematical solution is safe?
A multirow insert is an atomic operation in MySQL (both MyISAM and InnoDB). Since the table will be locked for writing during this operations, no other rows will be inserted/updated during it's execution.
This means IDs will in fact be consecutive (unless auto-increment-increment option is set to something different than 1
Auto increment does exactly that, it auto-increments - i.e. each new row next the numerically next ID. MySQL does not re-use IDs of rows that were deleted.
Your solution is safe because write operations aquire a table lock, so no other inserts can happen while your operation completes - so you will get n contiguous auto-increment values for n inserted rows.
Related
I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.
I am having a strange issue with MySQL Query, i am having a table with the fields
slno,mobileno , contractor with slno as primary key and auto
increment sequence 1
, while test say uptil 100 records the count and the autoincrement values are same ,
so i truncated the table to reset the autoincrement and inserted a huge excel file with around 40k data via php, then issued select query which yields
the max of slno is 40000 as expected but the count shows 39920
, I am just amused and tried to find over google , may be my lack of keyword search ability prevented me from finding result, so i am posting in here, for ref added screen shot, Any ideas and clarifications. Thanks
EDIT:
min slno is 1
EDIT :
A related question with solution to find gap in auto number in mysql has been asked and solved here.
There are specific cases in which auto-incremented values can be lost. One example is if you roll back an insertion. As per the doco:
"Lost" auto-increment values and sequence gaps
In all lock modes (0, 1, and 2), if a transaction that generated auto-increment values rolls back, those auto-increment values are "lost". Once a value is generated for an auto-increment column, it cannot be rolled back, whether or not the "INSERT-like" statement is completed, and whether or not the containing transaction is rolled back. Such lost values are not reused. Thus, there may be gaps in the values stored in an AUTO_INCREMENT column of a table.
In that case, although the insert is backed out, the auto-increment may not be. That would certainly allow for the possibility that your bulk insertion from Excel is occasionally failing and retrying, with the subsequent retry working. It really depends on how your insertion process works.
In any case, assuming those values will always be contiguous is actually a bad assumption to make.
This is because, even if insertions were guaranteed to be contiguous, it's possible to delete rows which would result in gaps appearing. You can certainly fix this each time you delete (or bulk insert for that matter) but the workload is high - you basically have to find gaps and then "move" higher entries into those gaps.
This movement is likely to be non-trivial as it's most likely that there will be other tables holding key look-ups to that column, and each of those will need to be changed as well.
So the best use case for an auto-increment field is simply to provide a unique identifier for the row where no other one exists and not to be necessarily contiguous.
I have billion row table that no longer fits in the memory.
When I insert new rows in bulk, the overhead of recounting the primary index, kills the performance. I HAVE to have this index because otherwise SELECT statements are really slow. But since the inserts come in a random order, with each row inserted, the data has to be written in different area of the disk.
And since the HDD is capped at 200 IO operations per second, this slows the inserting to a crawl.
Can I "have my cake and eat it" at the same time in this situation? Maybe by creating another table in which the data would be grouped by different column ( by having a different primary key )? But this seems wasteful to me and I don't even know if that would help...
Or maybe I could use some staging table? Insert there 1,000,000 rows and then insert them to the target table, grouped up by the primary key?
Am I doomed?
EDIT:
I've partitioned the table horizontally.
When I removed the primary key on this field that I need and placed it on the autoincrement field, the inserts were blazingly fast.
Unfortunately, since the data on disk is placed by the primary key value, this killed the select performance... because selects don't query based on the autoincrement value but rather on the PK value.
So either I insert rows fast or I select them fast. Isn't there any solution that could help in both cases?
.When you insert new row each time it will do indexing after data insert. It's take more time.
You can use
START TRANSACTION
...You r insert query...
COMMIT
Try Like this
mysql_query("START TRANSACTION");
your insert query
mysql_query("COMMIT");
In my application, I have a lot of foreign key dependencies, and often insert large numbers of rows. What I have done up until now is run a single insert at a time, and record the insert ID. This tends to take a long time when inserting a large number of rows, even when apache and mysql are run on the same server.
My question is, if I were to alter my application to add a number of rows with a single INSERT, would I be able to assume the ids of each row based strictly upon the last insert id returned by the mysql connection? The issue is that there is the occasional situation where more than one person will be putting large amounts of information into the database at a time.
From what I have been able to determine, it seems safe to assume that when you insert 500 rows, your insert ids will range from (lastInsertID) to (lastInsertID+499), regardless of whether a query from another connection has begun or ended in the time it took to complete, but I want to be sure this is accepted as safe practice.
I am primarily running InnoDB, but there is the occasional MyISAM in there as well.
Thanks All!
-Jer
The mysql_insert_id and the now recommended alternative mysqli_insert_id returns the first entry id from the affected table's AUTO_INCREMENT column of the last query you've ran.
Mysql grouped INSERT statements are inserted iteratively from the first entry AUTO_INCREMENT index. So yes it is safe to assume that the entries inserted from a single INSERT statement are contiguous.
Comment from the Mysql reference "The ID that was generated is maintained in the server on a per-connection basis. This means that the value returned by the function to a given client is the first AUTO_INCREMENT value generated for most recent statement affecting an AUTO_INCREMENT column by that client. This value cannot be affected by other clients, even if they generate AUTO_INCREMENT values of their own. This behavior ensures that each client can retrieve its own ID without concern for the activity of other clients, and without the need for locks or transactions.
There is a large table that holds millions of records. phpMyAdmin reports 1.2G size for the table.
There is a calculation that needs to be done for every row. The calculation is not simple (cannot be put in set col= calc format), it uses a stored function to get the values, so currently we have for each row a single update.
This is extremely slow and we want to optimize it.
Stored function:
https://gist.github.com/a9c2f9275644409dd19d
And this is called by this method for every row:
https://gist.github.com/82adfd97b9e5797feea6
This is performed on a off live server, and usually it is updated once per week.
What options we have here.
Why not setup a separate table to hold the computed values to take the load off your current table. It can have two columns: primary key for each row in your main table and a column for the computed value.
Then your process can be:
a) Truncate computedValues table - This is faster than trying to identify new rows
b) Compute the values and insert into the computed values table
c) So when ever you need your computed values you join to the computedValues table using a primary key join which is fast, and in case you need more computations well you just add new columns.
d) You can also update the main table using the computed values if you have to
Well, the problem doesn't seem to be the UPDATE query because no calculations are performed in the query itself. As it seems the calculations are performed first and then the UPDATE query is run. So the UPDATE should be quick enough.
When you say "this is extremely slow", I assume you are not referring to the UPDATE query but the complete process. Here are some quick thoughts:
As you said there are millions of records, updating those many entries is always time consuming. And if there are many columns and indexes defined on the table, it will add to the overhead.
I see that there are many REPLACE INTO queries in the function getNumberOfPeople(). These might as well be a reason for the slow process. Have you checked how efficient are these REPLACE INTO queries? Can you try removing them and then see if it has any impact on the UPDATE process.
There are a couple of SELECT queries too in getNumberOfPeople(). Check if they might be impacting the process and if so, try optimizing them.
In procedure updateGPCD(), you may try replacing SELECT COUNT(*) INTO _has_breakdown with SELECT COUNT(1) INTO _has_breakdown. In the same query, the WHERE condition is reading _ACCOUNT but this will fail when _ACCOUNT = 0, no?
On another suggestion, if it is the UPDATE that you think is slow because of reason 1, it might make sense to move the updating column gpcd outside usage_bill to another table. The only other column in the table should be the unique ID from usage_bill.
Hope the above make sense.