mysql historical data and record id

mysql historical data and record id - php

I am setting up a new part of an application with historical data requirements for the transactions table in mysql. Originally in old version transactions were not historical, with structure like this:
id|buyerid|prodid|price|status
And other fields, with the id being referenced in links to access Transaction Details page, as well as used as foreign key in other tables across the application to reference particular transactions for various purposes.
Now the requirement is to answer reporting questions like "Show all transaction that had particular status Feb 2014" AND "What did a transaction look like in Feb 2014".
The new design I'm testing at the moment is below:
id|buyerid|prodid|price|status|active|start_date|end_date
Where active used to indicate latest record, start is when it is created, no records to be modified instead end date populated and a new record created with same details plus the modification.
Now the question is - what to do about transaction id field? Because in this new design it is more of a history id, and can not be used for a foreign key across the application since it is going to change with every update.
I can think of two options:
Create a separate table, transaction_ids with just one column, primary key autoincrement tid, and a foreign key column in the main transactions table for tid - Every time a brand new transaction is created, insert the ids table and use that id for the tid to trace this particular transaction across the system.
The buyerid and prodid combination is always unique in my application, no buyer can get the same product twice.
Is the second solution better? Does anyone know of a better way to handle this?

What you are trying to achieve is called Event Sourcing.
Think in terms of events changing the status of your transaction, rather than tracing the status itself in time.
You still have your transaction with its own primary key, and you rebuild the current (or past) status applying each event.
I would also suggest you to start coding your business models, and only after that, to think about the persistence and the best way to map it to a database.

Second Solution looks better although I will say that there is a lot of ambiguity in your question.
I am saying that second solution is better because the transaction_ids table which you are talking about in solution 1 is basically REDUNDANT. It is not solving any purpose. Even if the transaction id is repeating itself in the transaction table, it does not mean that you need to have a separate table to generate the ids and make it as PK-FK relation. Most probably you will still be querying the data by user-id and prod-id and not by transaction-id
Basically what you need is some kind of audit history table where you insert a record for every operation/transaction/modification done and capture some basic details like - Username, Date/time, old value, new value etc. You do not need status or start date and end date columns. Once a record is inserted in this audit history table then it is never going to be touched again.
You will have to design your report carefully.

Taking two previous answers into consideration, here is the solution I will go with: All of the data updates in my application come through one single function, that is already set up to audit particular fields of my choosing, so I will mark the transaction status to be audited among the others. Table structure for the audit table is similar to this:
|id|table|table_id|column|old_val|new_val|who|when|
Only that there is a bit more advanced object mapping via object id's instead of simple table name. I can then use this data in a Join to the main, normal not historical transactions table to provide the reporting required.

Related

Better approach for updating multiple data

I have this MySQL table, where row contact_id is unique for each user_id.
history:
- hist_id: int(11) auto_increment primary key
- user_id: int(11)
- contact_id: int(11)
- name: varchar(50)
- phone: varchar(30)
From time to time, server will receive a new list of contacts for a specific user_id and need to update this table, inserting, deleting or updating data that is different from previous information.
For example, currenty data is:
So, server receive this data:
And the new data is:
As you can see, first row (John) was updated, second row (Mary) was deleted and some other row (Jeniffer) was included.
Today what I am doing is deleting all rows with a specific user_id, and inserting the new data. But the autoincrement field (hist_id) is getting bigger and bigger...
Obs: Table have about 80 thousand records, and this update will occur 30 times a day or more.
I have some (related) questions:
1. In this scenario, do you think deleting all records from a specific user_id and inserting updated data is a good approach?
2. What about removing the autoincrement field? I don't need it, but I think it is not a good idea to have a table without a primary key.
3. Or maybe the better approach is to loop new data, selecting each user_id / contact_id for comparing values to update?
PS. For better approach I mean the most efficient way
Thank you so much for any help!

In this scenario, do you think deleting all records from a specific user_id and inserting updated data is a good approach?
Short Answer
No. You should be taking advantage of 'upsert' which is short for 'insert on duplicate key update'. What this means is that if they key pair you're inserting already exists, update the specified columns with the specified data. You then shorten your logic and reduce increments. Here's an example, using your table structure that should work. This is also assuming that you have set the user_id and contact_id fields to unique.
INSERT INTO history (user_id, contact_id, name, phone)
VALUES
(1, 23, 'James Jr.', '(619)-543-6222')
ON DUPLICATE KEY UPDATE
name=VALUES(name),
phone=VALUES(phone);
This query should retain the contact_id but overwrite the prexisting data with the new data.
What about removing the autoincrement field? I don't need it, but I think it is not a good idea to have a table without a primary key.
Primary keys do not imply auto incremented values. I could have a varchar field as the primary key containing names of fruits and vegetables. Is this optimized for performance? Probably not. There many situations that might call for auto increment and there are definite reasons to avoid it. It all depends on how you wish to access the data and how this can impact future expansion. In your situation, I would start over on the table structure and re-think how you wish to store and access the data. Do you want to write more logic to control the data OR do you want the data to flow naturally by itself? You've made a history table that is functioning more like a hybrid many-to-one crosswalk at first glance. Without looking at the remaining table structure, I can't necessarily say on a whim that it's not a good idea. What I can say is that I would do this a bit differently. I will answer this more specifically in the next question.
Or maybe the better approach is to loop new data, selecting each user_id / contact_id for comparing values to update?
I would avoid looping through the data in order to update it. That is a job for SQL and it does this job well. Sometimes, we might find ourselves in a situation where we must do this to either extract data in a specific format or to repair data in some way however, avoid doing this for inserting or updating the data. It can negatively impact performance and you will likely paint yourself into a corner.
Back to what I said toward the end of your second question which will help you see what I am talking about. I am going to assume that user_id is a primary key that is auto-incremented in your user table. I will do some guestimation here and show you an example of how you can redesign your user, contact and phone number structure. The following is a quick model I threw together that shows the foreign key relationship between the tables.
Note: The column names and overall data arrangement could be done differently but I did this quickly to give you a decent example of a normalized database structure. All of the foreign keys have a structural layout which separates your data in a way that enables you to control the flow of data as it enters and leaves your system. Here's the screenshot of the database model I threw together using MySQL Workbench.
(source: xonos.net)
Here's the SQL so that you can look at it more closely.
You'll notice that the "person" table is extracted from users but shares data with contacts. This enables you to store all "people" in one place, all "users" in another and all "contacts" in another. Now, why would we do this? The number one reason can be explained in two scenarios.
1.) Say we have someone, in this example I'll call him "Jim Bean". "Jim Bean" works for the company, so he is a user of the system. But, "Jim Bean" happens to own a side business and does contact work for the company at the same time. So, he is both a contact and a user of the system. In a more "flat table" environment, we would have two records for Jim Bean that contain the same data which could become outdated or incorrect, quickly.
2.) Let's say that Jim did some bad things and the company wants nothing to do with him anymore. They don't want any record of him - as if he never existed. All that we have to do is delete Jim Bean from the Person table. That's it. Since the foreign relationship has "CASCADE" on update/delete - this automatically propagate and clears out the other tables related to him.
I highly recommend that you do some reading on normalized data structure. It has saved me many hours once I got the hang of it and I will never go back.

Generating my own Eloquent model insert IDs - How to avoid PK Collisions?

Maybe this is a stupid question because I should defer PK increments to MySql itself, but I'm in a weird situation.
Basically to handle versioning and approvals in my system, I have revision_batch table which is a collection of things in a submission that a user wishes to insert or update to the database. It has columns like batch_id, the user_id of the submitter, and an approved value.
It also is parent to a collection of items in the revisions table. The revisions table has things like table_name, key, old_value, and new_value. I use this to store the changes someone wishes to make that may not be approved automatically.
When someone who doesn't have permission to, say, a "task" table, and they change the name of an task, a new revision_batch will be created, and a new revision will be created with table_name="tasks", key=[whatever the task's ID is], old_value="my old task name", new_value="my new task name".
When an approver approves of this batch, my code will rocket through the revisions in the batch and perform the update or inserts to the database.
My problem is when performing parent-child relationships within the same batch. If I'm creating a new task and want to assign a task_item to it, in the same batch, then I need to know what PK the task is getting so that I can give the task_item a "task_id".
If I'm handing the creation of a new revision for a task, I might do something like a
select max(id)+1 as newId from tasks
to inject as the new id. But since I might already have a pending task insert revision with that ID or higher, I also check
select max(key) + 1 as newId
from revisions
inner join revision_batches on revisions.batch_id = revision_batches.id
where table_name='revisions' and approved = 'P'
for a higher id to assign. That way of I have ids 1-9 in a tasks table and 10-12 pending in the revisions table, any new direct insert using Laravel's Eloquent model class is overridden to check both tasks and revisions and will insert with id 13. This avoids collisions between actual cemented rows and possible revision rows. It also allows me to create a parent and many layers of children within a single batch because I determine their ID as I go along.
This is all works fine.
My problem is that if I have two revisions creations happening at the exact same time (like, within a millisecond) , they'll asynchronously both fetch the same next ID to use, both create revisions where key = the same number, and then only one will get through and the other fails on a PK collision.
My question is: is there a way to force this to be thread safe or to be done synchronously, to avoid two instances of the same controller method executing at the same time and both fetching the same ID to use? Can I lock a method down to a single instance at a time? If not, is there a better way I could be handling PK generation? The only reason I do this is to know beforehand the key to insert. But since custom code in the framework is handling PK generation and not the database, it's causing me this major issue. Happens sporadically, but only when I force the same method to execute maybe 4 times at the same time.
I know that I could avoid the majority of cases where I have many things being inserted at the exact same time, but that doesn't mean that randomly in the future that two users won't hit enter at the same time and recreate this issue.
Any ideas?
Thanks!

For this type of issues I use UUID 4, (Universally unique identifier), my case is a little bit different because I have a system in 74 different locations, but need to extract all the transaction records and integrate in a consolidation system, so my PKs needs to be unique across all servers to avoid collisions.
In laravel I use this excelent package to generate the UUID
I hope this works for you.

Use Queues for saving your revisions.
Queues are synchronous, and hence the key collision will never occur.
Source: http://laravel.com/docs/4.2/queues

Soft delete best practices (PHP/MySQL)

Problem
In a web application dealing with products and orders, I want to maintain information and relationships between former employees (users) and the orders they handled. I want to maintain information and relationships between obsolete products and orders which include these products.
However I want employees to be able to de-clutter the administration interfaces, such as removing former employees, obsolete products, obsolete product groups etc.
I'm thinking of implementing soft-deletion. So, how does one usually do this?
My immediate thoughts
My first thought is to stick a "flag_softdeleted TINYINT NOT NULL DEFAULT 0" column in every table of objects that should be soft deletable. Or maybe use a timestamp instead?
Then, I provide a "Show deleted" or "Undelete" button in each relevant GUI. Clicking this button you will include soft-deleted records in the result. Each deleted record has a "Restore" button. Does this make sense?
Your thoughts?
Also, I'd appreciate any links to relevant resources.

That's how I do it. I have a is_deleted field which defaults to 0. Then queries just check WHERE is_deleted = 0.
I try to stay away from any hard-deletes as much as possible. They are necessary sometimes, but I make that an admin-only feature. That way we can hard-delete, but users can't...
Edit: In fact, you could use this to have multiple "layers" of soft-deletion in your app. So each could be a code:
0 -> Not Deleted
1 -> Soft Deleted, shows up in lists of deleted items for management users
2 -> Soft Deleted, does not show up for any user except admin users
3 -> Only shows up for developers.
Having the other 2 levels will still allow managers and admins to clean up the deleted lists if they get too long. And since the front-end code just checks for is_deleted = 0, it's transparent to the frontend...

Using soft-deletes is a common thing to implement, and they are dead useful for lots of things, like:
Saving a user's data when they deleted something
Saving your own data when you delete something
Keep a track record of what really happened (a kind of audit)
etcetera
There is one thing I want to point out that almost everyone miss, and it always comes back to bite you in the rear piece. The users of your application does not have the same understanding of a delete as you have.
There are different degrees of deletions. The typical user deletes stuff when (s)he
Made a misstake and want to remove the bad data
Doesn't want to see something on the screen anymore
The problem is that if you don't record the intention of the delete, your application cannot distinguish between erronous data (that should never have been created) and historically correct data.
Have a look at the following data:
PRICES | item | price | deleted |
+------+-------+---------+
| A | 101 | 1 |
| B | 110 | 1 |
| C | 120 | 0 |
+------+-------+---------+
Some user doesn't want to show the price of item B, since they don't sell that item anymore. So he deletes it. Another user created a price for item A by misstake, so he deleted it and created the price for item C, as intended. Now, can you show me a list of the prices for all products? No, because either you have to display potentially erronous data (A), or you have to exclude all but current prices (C).
Of course the above can be dealt with in any number of ways. My point is that YOU need to be very clear with what YOU mean by a delete, and make sure that there is no way for the users to missunderstand it. One way would be to force the user to make a choice (hide/delete).

If I had existing code that hits that table, I would add the column and change the name of the table. Then I would create a view with the same name as the current table which selects only the active records. That way none of the existing code woudl break and you could have the soft delete column. If you want to see the deleted record, you select from the base table, otherwise you use the view.

I've always just used a deleted column as you mentioned. There's really not much more to it than that. Instead of deleting the record, just set the deleted field to true.
Some components I build allow the user to view all deleted records and restore them, others just display all records where deleted = 0

Your idea does make sense and is used frequently in production but, to implement it you will need to update quite a bit of code to account for the new field. Another option could be to archive (move) the "soft-deleted" records to a separate table or database. This is done frequently as well and makes the issue one of maintenance rather than (re)programming. (You could have a table trigger react to the delete to archive the deleted record.)
I would do the archiving to avoid a major update to production code. But if you want to use deleted-flag field, use it as a timestamp to give you additional useful info beyond a boolean. (Null = not deleted.) You might also want to add a DeletedBy field to track the user responsible for deleting the record. Using two fields gives you a lot of info tells you who deleted what and when. (The two extra field solution is also something that can be done in an archive table/database.)

The most common scenario I've come across is what you describe, a tinyint or even bit representing a status of IsActive or IsDeleted. Depending on whether this is considered "business" or "persistence" data it may be baked into the application/domain logic as transparently as possible, such as directly in stored procedures and not known to the application code. But it sounds like this is legitimate business information for your needs so would need to be known throughout the code. (So users can view deleted records, as you suggest.)
Another approach I've seen is to use a combination of two timestamps to show a "window" of activity for a given record. It's a little more code to maintain it, but the benefit is that something can be scheduled to soft-delete itself at a pre-determined time. Limited-time products can be set that way when they're created, for example. (To make a record active indefinitely one could use a max value (or just some absurdly distant future date) or just have the end date be null if you're ok with that.)
Then of course there's further consideration of things being deleted/undeleted from time to time and tracking some kind of audit for that. The flag approach knows only the current status, the timestamp approach knows only the most recent window. But anything as complex as an audit trail should definitely be stored separately than the records in question.

Instead I would use a bin table in which to move all the records deleted from the other tables. The main problem with the delete flag is that with linked tables you will definitely run into a double key error when trying to insert a new record.
The bin table could have a structure like this:
id, table_name, data, date_time, user
Where
id is the primary key with auto increment
table_name is the name of the table from which the record was deleted
data contains the record in JSON format with name and value of all fields
date_time is the date and time of the deletion
user is the identifier of the user (if the system provides for it) who performed the operation
this method will not only save you from checking the delete flag at each query (immagine the ones with many joins), but will allow you to have only the really necessary data in the tables, facilitating any searches and corrections using SQL client programs

Revision control for multiple pieces of related data

I'm trying to figure out how to best keep revision/history information on revisions to multiple rows of data, in case for some reason we need to revert to that data.
This is the general sort of layout:
item
---------------
id
title
etc...
region
---------------
id
title
etc...
release_type
-----------------
id
title
etc...
items_released_dates_data
---------------------
item_id
region_id
release_type_id (these three form the primary key)
date
So you can have one release date per item + region_id + release_type and we basically only track the date (For the purposes of this question the 'date' could be a number, a string, or whatever. I'm certain to run into this issue again)
Changes are submitted in bulk, when new data is added everything in items_released_dates_data where item_id=your_id is first deleted then an insert statement adds the new values (perhaps this isn't the best way to do this?)
My thought was to create a table like:
items_release_dates_data_history
-------------------------------------
item_id
timestamp
description
raw_data
Making description a short summary of what was updated, and including the data in some format like json or xml or something that could be quickly decoded on the client side to give the user a review of the changes and a choice to revise to a given version. Then every entry to items_released_dates_data also requires an entry to items_released_dates_data_history (doesn't sound like a question does it? :| )
I've read something about mysql triggers that would be helpful here, but quite frankly I don't know a thing about them so I'm working with what I understand.
My question is, am I following the right path to version this stuff, and is there any advice/best practices anyone can give me on how to improve this method?

I second Alex Miller's comment. Everything you write make sense so far.
I'd strongly recommend looking into triggers though, despite your reservations. They're fairly easy to grasp, and make for a very powerful tool in such scenarios. Using triggers you can store a copy of the row into a separate table each time a record is updated (or deleted). If you want to go all fancy you can, within the trigger, compare the incoming data to the existing data, and write only what has changed.
Also consider the Archive storage engine instead of MyISAM or InnoDB for these kinds of tables - they're made for this kind of job.
Also, the search phrase you're probably looking for is "audit trail".

I'd say that you're definitely on the right track. Although, you may want to store the region ID in the history so you can check release history based on a region rather than just by entire items.
As for the delete+insert, that's fine as long as you don't end up with too much traffic, as those are both locking actions. There is a lot of time used when inserting or deleting a row to update the index. If you're using a MyISAM table, it's also going to halt all reads on the table until those actions complete. Update will as well, but for a much shorter time. InnoDB will only lock the row, so that's not really a concern.

How can I update multiple tables while guaranteeing no duplicate ids?

I'm used to building websites with user accounts, so I can simply auto-increment the user id, then let them log in while I identify that user by user id internally. What I need to do in this case is a bit different. I need to anonymously collect a few rows of data from people, and tie those rows together so I can easily discern which data rows belong to which user.
The difficulty I'm having is in generating the id to tie the data rows together. My first thought was to poll the database for the highest user ID in existence, and write to the database with user ID +1. This will fail, however, if two submissions poll the database before either of them writes to it - they will each share the same user ID.
Another thought I had was to create a separate user ID table that would be set to auto-increment, and simply generate a new row, then poll that table for the id of the last row created. That also fails for the same reason as above - if two submissions create a row before either of them polls for the latest user ID, then they'll end up sharing an ID.
Any ideas? I get the impression I'm missing something obvious.

I think I'm understanding you right; I was having a similar issue. There's a super handy php function, though. After you query the database to insert a new row and auto-incrementing their user ID, do:
$user_id = mysql_insert_id();
That just returns the auto-increment value from the previous query on the current mysql connection. You can read more about it here if you need to.
You can then use this to populate the second table's data, being sure nobody will get a duplicate ID from the first one.

You need to insert the user, get the auto-generated id, and then use that id as a foreign key in the couple of rows you need to associate with the parent record. The hat rack must exist before you can hang hats on it.

This is a common issue, and to solve it, you would use a transaction. This gives you the atomic idea being being able to do more than one thing, but have it tied to either a success or fail as a package. It's an advanced db feature, and does require awareness of some more advanced programming in order to implement it in as fault-tolerant a manner as possible.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.