Generating my own Eloquent model insert IDs - How to avoid PK Collisions?

Generating my own Eloquent model insert IDs - How to avoid PK Collisions? - php

Maybe this is a stupid question because I should defer PK increments to MySql itself, but I'm in a weird situation.
Basically to handle versioning and approvals in my system, I have revision_batch table which is a collection of things in a submission that a user wishes to insert or update to the database. It has columns like batch_id, the user_id of the submitter, and an approved value.
It also is parent to a collection of items in the revisions table. The revisions table has things like table_name, key, old_value, and new_value. I use this to store the changes someone wishes to make that may not be approved automatically.
When someone who doesn't have permission to, say, a "task" table, and they change the name of an task, a new revision_batch will be created, and a new revision will be created with table_name="tasks", key=[whatever the task's ID is], old_value="my old task name", new_value="my new task name".
When an approver approves of this batch, my code will rocket through the revisions in the batch and perform the update or inserts to the database.
My problem is when performing parent-child relationships within the same batch. If I'm creating a new task and want to assign a task_item to it, in the same batch, then I need to know what PK the task is getting so that I can give the task_item a "task_id".
If I'm handing the creation of a new revision for a task, I might do something like a
select max(id)+1 as newId from tasks
to inject as the new id. But since I might already have a pending task insert revision with that ID or higher, I also check
select max(key) + 1 as newId
from revisions
inner join revision_batches on revisions.batch_id = revision_batches.id
where table_name='revisions' and approved = 'P'
for a higher id to assign. That way of I have ids 1-9 in a tasks table and 10-12 pending in the revisions table, any new direct insert using Laravel's Eloquent model class is overridden to check both tasks and revisions and will insert with id 13. This avoids collisions between actual cemented rows and possible revision rows. It also allows me to create a parent and many layers of children within a single batch because I determine their ID as I go along.
This is all works fine.
My problem is that if I have two revisions creations happening at the exact same time (like, within a millisecond) , they'll asynchronously both fetch the same next ID to use, both create revisions where key = the same number, and then only one will get through and the other fails on a PK collision.
My question is: is there a way to force this to be thread safe or to be done synchronously, to avoid two instances of the same controller method executing at the same time and both fetching the same ID to use? Can I lock a method down to a single instance at a time? If not, is there a better way I could be handling PK generation? The only reason I do this is to know beforehand the key to insert. But since custom code in the framework is handling PK generation and not the database, it's causing me this major issue. Happens sporadically, but only when I force the same method to execute maybe 4 times at the same time.
I know that I could avoid the majority of cases where I have many things being inserted at the exact same time, but that doesn't mean that randomly in the future that two users won't hit enter at the same time and recreate this issue.
Any ideas?
Thanks!

For this type of issues I use UUID 4, (Universally unique identifier), my case is a little bit different because I have a system in 74 different locations, but need to extract all the transaction records and integrate in a consolidation system, so my PKs needs to be unique across all servers to avoid collisions.
In laravel I use this excelent package to generate the UUID
I hope this works for you.

Use Queues for saving your revisions.
Queues are synchronous, and hence the key collision will never occur.
Source: http://laravel.com/docs/4.2/queues

Related

SQL - auto increment withing group inside one table [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

mysql historical data and record id

I am setting up a new part of an application with historical data requirements for the transactions table in mysql. Originally in old version transactions were not historical, with structure like this:
id|buyerid|prodid|price|status
And other fields, with the id being referenced in links to access Transaction Details page, as well as used as foreign key in other tables across the application to reference particular transactions for various purposes.
Now the requirement is to answer reporting questions like "Show all transaction that had particular status Feb 2014" AND "What did a transaction look like in Feb 2014".
The new design I'm testing at the moment is below:
id|buyerid|prodid|price|status|active|start_date|end_date
Where active used to indicate latest record, start is when it is created, no records to be modified instead end date populated and a new record created with same details plus the modification.
Now the question is - what to do about transaction id field? Because in this new design it is more of a history id, and can not be used for a foreign key across the application since it is going to change with every update.
I can think of two options:
Create a separate table, transaction_ids with just one column, primary key autoincrement tid, and a foreign key column in the main transactions table for tid - Every time a brand new transaction is created, insert the ids table and use that id for the tid to trace this particular transaction across the system.
The buyerid and prodid combination is always unique in my application, no buyer can get the same product twice.
Is the second solution better? Does anyone know of a better way to handle this?

What you are trying to achieve is called Event Sourcing.
Think in terms of events changing the status of your transaction, rather than tracing the status itself in time.
You still have your transaction with its own primary key, and you rebuild the current (or past) status applying each event.
I would also suggest you to start coding your business models, and only after that, to think about the persistence and the best way to map it to a database.

Second Solution looks better although I will say that there is a lot of ambiguity in your question.
I am saying that second solution is better because the transaction_ids table which you are talking about in solution 1 is basically REDUNDANT. It is not solving any purpose. Even if the transaction id is repeating itself in the transaction table, it does not mean that you need to have a separate table to generate the ids and make it as PK-FK relation. Most probably you will still be querying the data by user-id and prod-id and not by transaction-id
Basically what you need is some kind of audit history table where you insert a record for every operation/transaction/modification done and capture some basic details like - Username, Date/time, old value, new value etc. You do not need status or start date and end date columns. Once a record is inserted in this audit history table then it is never going to be touched again.
You will have to design your report carefully.

Taking two previous answers into consideration, here is the solution I will go with: All of the data updates in my application come through one single function, that is already set up to audit particular fields of my choosing, so I will mark the transaction status to be audited among the others. Table structure for the audit table is similar to this:
|id|table|table_id|column|old_val|new_val|who|when|
Only that there is a bit more advanced object mapping via object id's instead of simple table name. I can then use this data in a Join to the main, normal not historical transactions table to provide the reporting required.

What is an elegant / efficient way of storing the status of 100 lessons for multiple users?

I'm working on an app in JavaScipt, jQuery, PHP & MySQL that consists of ~100 lessons. I am trying to think of an efficient way to store the status of each user's progress through the lessons, without having to query the MySQL database too much.
Right now, I am thinking the easiest implementation is to create a table for each user, and then store each lesson's status in that table. The only problem with that is if I add new lessons, I would have to update every user's table.
The second implementation I considered would be to store each lesson as a table, and record the user ID for each user that completed that lesson there - but then generating a status report (what lessons a user completed, how well they did, etc.) would mean pulling data from 100 tables.
Is there an obvious solution I am missing? How would you store your users progress through 100 lessons, so it's quick and simple to generate a status report showing their process.
Cheers!

The table structure I would recommend would be to keep a single table with non-unique fields userid and lessonid, as well as the relevant progress fields. When you want the progress of user x on lesson y, you would do this:
SELECT * FROM lessonProgress WHERE userid=x AND lessonid=y LIMIT 1;
You don't need to worry about performance unless you see that it's actually an issue. Having a table for each user or a table for each lesson are bad solutions because there aren't meant to be a dynamic number of tables in a database.

If reporting is restricted to one user at a time - that is, when generating a report, it's for a specific user and not a large clump of users - why not consider javascript object notation stored in a file? If extensibility is key, it would make it a simple matter.
Obviously, if you're going to run reports against an arbitrarily large number of users at once, separate data files would become inefficient.
Discarding the efficiency argument, json would also give you a very human-readable and interchangeable format.
Lastly, if the security of the report output isn't a big sticking point, you'd also gain the ability to easily offload view rendering onto the client.

Use relations between 2 tables. One for users with user specific columns like ID, username, email, w/e else you want to store about them.
Then a status table that has a UID foreign key. ID UID Status etc.
It's good to keep datecreated and dateupdated on tables as well.
Then just join the tables ON status.UID = users.ID

A good option will be to create one table with an user_ID as primary key and a status (int) each row of the table will represent a user. Accessing to its progress would be fast a simple since you have an index of user IDs.
In this way, adding new leassons would not make you change de DB

How can I update multiple tables while guaranteeing no duplicate ids?

I'm used to building websites with user accounts, so I can simply auto-increment the user id, then let them log in while I identify that user by user id internally. What I need to do in this case is a bit different. I need to anonymously collect a few rows of data from people, and tie those rows together so I can easily discern which data rows belong to which user.
The difficulty I'm having is in generating the id to tie the data rows together. My first thought was to poll the database for the highest user ID in existence, and write to the database with user ID +1. This will fail, however, if two submissions poll the database before either of them writes to it - they will each share the same user ID.
Another thought I had was to create a separate user ID table that would be set to auto-increment, and simply generate a new row, then poll that table for the id of the last row created. That also fails for the same reason as above - if two submissions create a row before either of them polls for the latest user ID, then they'll end up sharing an ID.
Any ideas? I get the impression I'm missing something obvious.

I think I'm understanding you right; I was having a similar issue. There's a super handy php function, though. After you query the database to insert a new row and auto-incrementing their user ID, do:
$user_id = mysql_insert_id();
That just returns the auto-increment value from the previous query on the current mysql connection. You can read more about it here if you need to.
You can then use this to populate the second table's data, being sure nobody will get a duplicate ID from the first one.

You need to insert the user, get the auto-generated id, and then use that id as a foreign key in the couple of rows you need to associate with the parent record. The hat rack must exist before you can hang hats on it.

This is a common issue, and to solve it, you would use a transaction. This gives you the atomic idea being being able to do more than one thing, but have it tied to either a success or fail as a package. It's an advanced db feature, and does require awareness of some more advanced programming in order to implement it in as fault-tolerant a manner as possible.

How to do monthly refresh of large DB tables without interrupting user access to them

I have four DB tables in an Oracle database that need to be rewritten/refreshed every week or every month. I am writing this script in PHP using the standard OCI functions, that will read new data in from XML and refresh these four tables. The four tables have the following properties
TABLE A - up to 2mil rows, one primary key (One row might take max 2K data)
TABLE B - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE C - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE D - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 120 bytes of data)
So I need to repopulate these tables without damaging the user experience. I obviously can't delete the tables and just repopulate them as it is a somewhat lengthy process.
I've considered just a big transaction where I DELETE FROM all of the tables and just regenerate them. I get a little concerned about the length of the transaction (don't know yet but it could take an hour or so).
I wanted to create temp table replicas of all of the tables and populate those instead. Then I could DROP the main tables and rename the temp tables. However you can't do the DROP and ALTER table statements within a transaction as they always do an auto commit. This should be able to be done quickly (four DROP and and four ALTER TABLE statements), but it can't guarantee that a user won't get an error within that short period of time.
Now, a combination of the two ideas, I'm considering doing the temp tables, then doing a DELETE FROM on all four original tables and then and INSERT INTO from the temp tables to repopulate the main tables. Since there are no DDL statements here, this would all work within a transaction. Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
I would think this would be a common scenario. Is there a standard or recommended way of doing this? Any tips would be appreciated. Thanks.

You could have a synonym for each of your big tables. Create new incarnations of your tables, populate them, drop and recreate the synonyms, and finally drop your old tables. This has the advantage of (1) only one actual set of DML (the inserts) avoiding redo generation for your deletes and (2) the synonym drop/recreate is very fast, minimizing the potential for a "bad user experience".
Reminds me of a minor peeve of mine about Oracle's synonyms: why isn't there an ALTER SYNONYM command?

I'm assuming your users don't actually modify the data in these tables since it is deleted from another source every week, so it doesn't really matter if you lock the tables for a full hour. The users can still query the data, you just have to size you rollback segment appropriately. A simple DELETE+INSERT therefore should work fine.
Now if you want to speed things up AND if the new data has little difference with the previous data you could load the new data into temporary tables and updating the tables with the delta with a combination of MERGE+DELETE like this:
Setup:
CREATE TABLE a (ID NUMBER PRIMARY KEY, a_data CHAR(200));
CREATE GLOBAL TEMPORARY TABLE temp_a (
ID NUMBER PRIMARY KEY, a_data CHAR(200)
) ON COMMIT PRESERVE ROWS;
-- Load A
INSERT INTO a
(SELECT ROWNUM, to_char(ROWNUM) FROM dual CONNECT BY LEVEL <= 10000);
-- Load TEMP_A with extra rows
INSERT INTO temp_a
(SELECT ROWNUM + 100, to_char(ROWNUM + 100)
FROM dual
CONNECT BY LEVEL <= 10000);
UPDATE temp_a SET a_data = 'x' WHERE mod(ID, 1000) = 0;
This MERGE statement will insert the new rows and update the old rows only if they are different:
SQL> MERGE INTO a
2 USING (SELECT temp_a.id, temp_a.a_data
3 FROM temp_a
4 LEFT JOIN a ON (temp_a.id = a.id)
5 WHERE decode(a.a_data, temp_a.a_data, 1) IS NULL) temp_a
6 ON (a.id = temp_a.id)
7 WHEN MATCHED THEN
8 UPDATE SET a.a_data = temp_a.a_data
9 WHEN NOT MATCHED THEN
10 INSERT (id, a_data) VALUES (temp_a.id, temp_a.a_data);
Done
You will then need to delete the rows that aren't in the new set of data:
SQL> DELETE FROM a WHERE a.id NOT IN (SELECT temp_a.id FROM temp_a);
100 rows deleted
You would insert into A then into the child tables and deleting in reverse order.

Am I the only one (except Vincent) who would first test the simplest possible solution, i.e. DELETE/INSERT, before trying to build something more advanced?
Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
Oracle manages memory quite well, it hasn't been written by a bunch of Java novices (oops it just came out of my mouth!). So the real question is, do you have to worry about the performance penalties of thrashing REDO and UNDO log files... In other words, build a performance test case and run it on your server and see how long it takes. During the DELETE / INSERT the system will be not as responsive as usual but other sessions can still perform SELECTs without any fears of deadlocks, memory leaks or system crashes. Hint: DB servers are usually disk-bound, so getting a proper RAID array is usually a very good investment.
On the other hand, if the performance is critical, you can select one of the alternative approaches described in this thread:
partitioning if you have the license
table renaming if you don't, but be mindful that DDLs on the fly can cause some side effects such as object invalidation, ORA-06508...

In Oracle your can partition your tables and indexes based on a Date or time column that way to remove a lot of data you can simply drop the partition instead of performing a delete command.
We used to use this to manage monthly archives of 100 Million+ records and not have downtime.
http://www.oracle.com/technology/oramag/oracle/06-sep/o56partition.html is a super handy page for learning about partitioning.

I assume that this refreshing activity is the only way of data changing in these tables, so that you don't need to worry about inconsistencies due to other writing processes during the load.
All that deleting and inserting will be costly in terms of undo usage; you also would exclude the option of using faster data loading techniques. For example, your inserts will go much, much faster if you insert into the tables with no indexes, then apply the indexes after the load is done. There are other strategies as well, but both of them preclude the "do it all in one transaction" technique.
Your second choice would be my choice - build the new tables, then rename the old ones to a dummy name, rename the temps to the new name, then drop the old tables. Since the renames are fast, you'd have a less than one second window when the tables were unavailable, and you'd then be free to drop the old tables at your leisure.
If that one second window is unacceptable, one method I've used in situations like this is to use an additional locking object - specifically, a table with a single row that users would be required to select from before they access the real tables, and that your load process could lock in exclusive mode before it it does the rename operation.
Your PHP script would use two connections to the db - one where you do the lock, the other where you do the loading, renaming and dropping. This way the implicit commits in the work connection won't terminate the lock in the other table.
So, in the script, you'd do something like:
Connection 1:
Create temp tables, load them, create new indexes
Connection 2:
LOCK TABLE Load_Locker IN SHARE ROW EXCLUSIVE MODE;
Connection 1:
Perform renaming swap of old & new tables
Connection 2:
Rollback;
Connection 1:
Drop old tables.
Meanwhile, your clients would issue the following command immediately after starting a transaction (or a series of selects):
LOCK TABLE Load_Locker IN SHARE MODE;
You can have as many clients locking the table this way - your process above will block behind them until they have all released the lock, at which point subsequent clients will block until you perform your operations. Since the only thing you're doing inside the context of the SHARE ROW EXCLUSIVE lock is renaming tables, your clients would only ever block for an instant. Additionally, putting this level of granularity allows you to control how long the clients would have a read consistent view of the old table; without it, if you had a client that did a series of reads that took some time, you might end up changing the tables mid-stream and wind up with weird results if the early queries pulled old data & the later queries pulled new data. Using SET TRANSACTION SET ISOLATION LEVEL READ ONLY would be another way of addressing this issue if you weren't using my approach.
The only real downside to this approach is that if your client read transactions take some time, you run the risk of other clients being blocked for longer than an instant, since any locks in SHARE MODE that occur after your load process issues its SHARE ROW EXCLUSIVE lock will block until the load process finishes its task. For example:
10:00 user 1 issues SHARE lock
10:01 user 2 issues SHARE lock
10:03 load process issues SHARE ROW EXCLUSIVE lock (and is blocked)
10:04 user 3 issues SHARE lock (and is blocked by load's lock)
10:10 user 1 releases SHARE
10:11 user 2 releases SHARE (and unblocks loader)
10:11 loader renames tables & releases SHARE ROW EXCLUSIVE (and releases user 3)
10:11 user 3 commences queries, after being blocked for 7 minutes
However, this is really pretty kludgy. Kinlan's solution of partitioning is most likely the way to go. Add an extra column to your source tables that contains a version number, partition your data based on that version, then create views that look like your current tables that only show data that shows the current version (determined by the value of a row in a "CurrentVersion" table). Then just do your load into the table, update your CurrentVersion table, and drop the partition for the old data.

Why not add a version column? That way you can add the new rows with a different version number. Create a view against the table that specifies the current version. After the new rows are added recompile the view with the new version number. When that's done, go back and delete the old rows.

What we do in some cases is have two versions of the tables, say SalesTargets1 and SalesTargets2 (an active and inactive one.) Truncate the records from the inactive one and populate it. Since no one but you uses the inactive one, there should be no locking issues or impact on the users while it is populating. Then have view that selcts all the information from the active table (it should be named what the current table is now, say SalesTargets in my example). Then to switch to the refreshed data, all you have to do is run an alter view statement.

Have you evaluated the size of the delta (of changes).
If the number of rows that get updated (as opposed to inserted) every time you put up a new rowset it not too high, then I think you should consider importing the new set of data into a set of staging tables and do an update-where-exists and insert-where-not-exists (UPSERT) solution and just refresh your indexes (ok ok indices).
Treat it like ETL.

I'm going with an upsert method here.
I added an additional "delete" column to each of the tables.
When I begin processing the feed, I set the delete field for every record to '1'.
Then I go through a serious of updates if the record exists, or inserts if it does not. For each of those inserts/updates, the delete field is then set to zero.
At the end of the process I delete all records that still have a delete value of '1'.
Thanks everybody for your answers. I found it very interesting/educational.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.