PostgreSQL: best way to create new/duplicate existing tables every year

PostgreSQL: best way to create new/duplicate existing tables every year - php

referring to this question, I've decided to duplicate the tables every year, creating tables with the data of the year, something like, for example:
orders_2008
orders_2009
orders_2010
etc...
Well, I know that probably the speed problem could be solved with just 2 tables for each element, like orders_history and order_actual, but I thought that once the handler code is been wrote, there will be no difference.. just many tables.
Those tables will have even some child with foreign key;
for example the orders_2008 will have the child items_2008:
CREATE TABLE orders_2008 (
id serial NOT NULL,
code character(5),
customer text
);
ALTER TABLE ONLY orders_2008
ADD CONSTRAINT orders_2008_pkey PRIMARY KEY (id);
CREATE TABLE items_2008 (
id serial NOT NULL,
order_id integer,
item_name text,
price money
);
ALTER TABLE ONLY items_2008
ADD CONSTRAINT items_2008_pkey PRIMARY KEY (id);
ALTER TABLE ONLY items_2008
ADD CONSTRAINT "$1" FOREIGN KEY (order_id) REFERENCES orders_2008(id) ON DELETE CASCADE;
So, my problem is: what do you think is the best way to replicate those tables every 1st january and, of course, keeping the table dependencies?
A PHP/Python script that, query after query, rebuild the structure for the new year (called by a cron job)?
Can the PostgreSQL's functions be used in that way?
If yes, how (an little example will be nice)
Actually I'm going for the first way (a .sql file containing the structure, and a php/python script loaded by cronjob that rebuild the structure), but i'm wondering if this is the best way.
edit: i've seen that the pgsql function CREATE TABLE LIKE, but the foreigns keys must be added in a second time.. or it will keep the new tables referencied tot he old one.

PostgreSQL has a feature that lets you create a table that inherits fields from another table. The documentation can be found in their manual. That might simplify your process a bit.

You should look at Partitioning in Postgresql. It's the standard way of doing what you want to do. It uses inheritance as John Downey suggested.

Very bad idea.
Have a look around partitioning and keep your eyes on your real goal:
You don't want table sets for every year, because this is not your problem. Lots of systems are working perfectly without them :)
You want to solve some performance and/or storage space issues.

I'd recommend orders and order_history... just periodically roll the old orders into the history, which is a read-only dataset now, so you add an index to cater for every single query you require, and it should (if your data structures are half decent) remain performant.
If your history table starts getting "too big" it's probably time to start think about data warehousing... which really is marvelous, but it certainly ain't cheap.

As others mentioned in your previous question, this is probably a bad idea. That said, if you are dead set on doing it this way, why not just create all the tables up front (say 2008-2050)?

Related

MySQL: How to ensure thread safety and avoid duplicate key errors, when Primary Key is generated in PHP?

I have a table in which the primary key is a 20 character VARCHAR field that gets generated in PHP before getting inserted into the table. The key generation logic uses grouping and sequencing mechanism as given below.
SELECT
SUBSTR(prod_code, 15) AS prod_num
FROM
items
, products
WHERE
items.cat_type = $category
AND items.sub_grp = $sub_grp
AND items.prod_id = products.prod_id
ORDER BY
prod_num DESC LIMIT 1
The prod_num thus got is incremented in PHP and prefixed with a product code to create a unique primary key. However multiple users can do the transaction concurrently for same category and sub_group leading to same keys generated for those users. This may lead to duplicate key error, as its a unique primary key. What is the best way to handle such a situation?

Don't use "Smart IDs".
Smart IDs were all the rage in the 1980s, and went out of fashion for several reasons:
The only requirement of a PK is that is has to be unique. A PK doesn't need to have a format, or to be sexy or good looking. Their specific sequence, case, or composition is not relevant and actually counter-productive.
They are not relational. Parts of the ID could establish a relationship with other tables and that can cause a lot of issues. This goes against Normal Forms defined in database design.
Now, if you still need a Smart ID, then create a secondary column (that can also be unique) and then populate it after the row is created. If you are facing thread safety issues, you can run a single deferred process that will assign nice looking values after a few minutes. Alternatively, you can implement a queue, that can resolve this is seconds.

Agree with "The Impaler".
But if you decide to proceed that way: to handle your concurrency issue could be through a retry-mechanism.
This is similar to how deadlocks are typically handled.
If the insertion fails because of violation of the unique primary key, just try again in PHP with a new key.
Your framework might have retry functions already. Otherwise it's easy to implement yourself.

Better approach for updating multiple data

I have this MySQL table, where row contact_id is unique for each user_id.
history:
- hist_id: int(11) auto_increment primary key
- user_id: int(11)
- contact_id: int(11)
- name: varchar(50)
- phone: varchar(30)
From time to time, server will receive a new list of contacts for a specific user_id and need to update this table, inserting, deleting or updating data that is different from previous information.
For example, currenty data is:
So, server receive this data:
And the new data is:
As you can see, first row (John) was updated, second row (Mary) was deleted and some other row (Jeniffer) was included.
Today what I am doing is deleting all rows with a specific user_id, and inserting the new data. But the autoincrement field (hist_id) is getting bigger and bigger...
Obs: Table have about 80 thousand records, and this update will occur 30 times a day or more.
I have some (related) questions:
1. In this scenario, do you think deleting all records from a specific user_id and inserting updated data is a good approach?
2. What about removing the autoincrement field? I don't need it, but I think it is not a good idea to have a table without a primary key.
3. Or maybe the better approach is to loop new data, selecting each user_id / contact_id for comparing values to update?
PS. For better approach I mean the most efficient way
Thank you so much for any help!

In this scenario, do you think deleting all records from a specific user_id and inserting updated data is a good approach?
Short Answer
No. You should be taking advantage of 'upsert' which is short for 'insert on duplicate key update'. What this means is that if they key pair you're inserting already exists, update the specified columns with the specified data. You then shorten your logic and reduce increments. Here's an example, using your table structure that should work. This is also assuming that you have set the user_id and contact_id fields to unique.
INSERT INTO history (user_id, contact_id, name, phone)
VALUES
(1, 23, 'James Jr.', '(619)-543-6222')
ON DUPLICATE KEY UPDATE
name=VALUES(name),
phone=VALUES(phone);
This query should retain the contact_id but overwrite the prexisting data with the new data.
What about removing the autoincrement field? I don't need it, but I think it is not a good idea to have a table without a primary key.
Primary keys do not imply auto incremented values. I could have a varchar field as the primary key containing names of fruits and vegetables. Is this optimized for performance? Probably not. There many situations that might call for auto increment and there are definite reasons to avoid it. It all depends on how you wish to access the data and how this can impact future expansion. In your situation, I would start over on the table structure and re-think how you wish to store and access the data. Do you want to write more logic to control the data OR do you want the data to flow naturally by itself? You've made a history table that is functioning more like a hybrid many-to-one crosswalk at first glance. Without looking at the remaining table structure, I can't necessarily say on a whim that it's not a good idea. What I can say is that I would do this a bit differently. I will answer this more specifically in the next question.
Or maybe the better approach is to loop new data, selecting each user_id / contact_id for comparing values to update?
I would avoid looping through the data in order to update it. That is a job for SQL and it does this job well. Sometimes, we might find ourselves in a situation where we must do this to either extract data in a specific format or to repair data in some way however, avoid doing this for inserting or updating the data. It can negatively impact performance and you will likely paint yourself into a corner.
Back to what I said toward the end of your second question which will help you see what I am talking about. I am going to assume that user_id is a primary key that is auto-incremented in your user table. I will do some guestimation here and show you an example of how you can redesign your user, contact and phone number structure. The following is a quick model I threw together that shows the foreign key relationship between the tables.
Note: The column names and overall data arrangement could be done differently but I did this quickly to give you a decent example of a normalized database structure. All of the foreign keys have a structural layout which separates your data in a way that enables you to control the flow of data as it enters and leaves your system. Here's the screenshot of the database model I threw together using MySQL Workbench.
(source: xonos.net)
Here's the SQL so that you can look at it more closely.
You'll notice that the "person" table is extracted from users but shares data with contacts. This enables you to store all "people" in one place, all "users" in another and all "contacts" in another. Now, why would we do this? The number one reason can be explained in two scenarios.
1.) Say we have someone, in this example I'll call him "Jim Bean". "Jim Bean" works for the company, so he is a user of the system. But, "Jim Bean" happens to own a side business and does contact work for the company at the same time. So, he is both a contact and a user of the system. In a more "flat table" environment, we would have two records for Jim Bean that contain the same data which could become outdated or incorrect, quickly.
2.) Let's say that Jim did some bad things and the company wants nothing to do with him anymore. They don't want any record of him - as if he never existed. All that we have to do is delete Jim Bean from the Person table. That's it. Since the foreign relationship has "CASCADE" on update/delete - this automatically propagate and clears out the other tables related to him.
I highly recommend that you do some reading on normalized data structure. It has saved me many hours once I got the hang of it and I will never go back.

Do you need to set foreign keys in MySQL?

Let's say you have got two tables like the following in a MySQL database:
TABLE people:
primary key: PERSON_ID,
NAME,
SURNAME, etc.
TABLE addresses:
primary key: ADDRESS_ID,
foreign key: PERSON_ID,
addressLine1, etc.
If you manage the creation of rows (in both table) and the retrieving of data trough PHP do you still need to create a physical relationship in the database? If yes, why?

Yes, one concrete reason is to have faster retrieving of rows if you want to join tables. Creating a foreign key constraint automatically creates a an index on the column.
So table address' schema should look like this, (assuming People's table primary key is PERSON_ID)
CREATE TABLE Address
(
Address_ID INT,
Person_ID INT,
......,
CONSTRAINT tb_pk PRIMARY KEY (Address_ID),
CONTRRAINT tb_fk FOREIGN KEY (Person_ID)
REFERENCES People(Person_ID)
)

Strictly speaking: You don't need to use FK's. careful indexing and well written query's might seem to be sufficient. However FK's and certainly FK constraints are very useful when it comes to securing data consistency (avoiding orphaned data, for example)
Suppose you wrote your application, everything is tested and it works like a charm. Great, but who's to say that you'll be around every time something has to be changed? Are you going to maintain the code by yourself or is it likely that someone else might end up doing a quick fix/tweak or implement another feature down the road? In reality, you're never going to be the only one writing and maintaining the code, and even if you are the only one maintaining the code, you're almost certainly going to encounter bugs as time passes...Foreign keys inform both your co-workers and you that data from tbl1 depends on the data from tbl2 and vice-versa. Just like comments, this makes the application easier to maintain.
Bugs are easier to detect: creating a method deleting a record from tbl1, but forgetting to update tbl2 to reflect the changes made to the first tbl. When this happens, the data is corrupted, but the query that caused this won't result in errors: the SQL is syntactically correct and the action it performs is the desired action. These kind of bugs could remain hidden for quite some time, and by the time this is spotted, god knows how much data has been corrupted...
Lastly, and this is an argument that is used all too often, what if the connection to the DB is lost mid-way through a series of update/delete query's? FK Constraints enable you to cascade certain actions. I haven't actually seen this happen, but I know of anybody who doesn't write code to protect against just such a scenarioDeleting or updating several relational records, but mid-way, the connection with the DB gets cut off for some reason. You might have edited tbl2, but the connection was lost before the query to tbl1 was sent. Again, we end up with corrupted data. FK CASCADE's are very useful here. Delete from tbl1, and set an ON DELETE CASCADE rule, so that you can rest assured that the related records are deleted from tbl2. In the same situation, ON DELETE RESTRICT, can be a fairly useful rule, too.
Note that FK's aren't the ultimate answer to life, the universe and everything (that's 42 - as we all know), but they are a vital part of true relational database-designs.

Referential integrity is an article that you should read and comprehend.

there are two ways
-first one is to handle all the things on coding end manage the things on deleting or updating a record
but when you use foreign key you are enforcing the relation and Db don't allow you to delete records with foreign key constraint especially when you don't want to delete the records related to it there is some situations accrue where you need to do this kind of tasks.
-Second way is to manage things on the Db side. If you have 1-to-many or many-to-many relations in database, foreign keys will be very useful. Also they have some good actions - RESTRICT, CASCADE, SET NULL, NO ACTION those can do some work for you

Trigger to multiple tables on INSERT

quick question.
In my user database I have 5 separate tables all containing different information. 4 tables are connected by foreign key to the primary key of the first table.
I am wanting to trigger row inserts on the other 4 tables when I do an insert on the first (primary). I thought that with ON UPDATE CASCADE would do this for me but after trying it I realised it did not...I know clue is in the name ON UPDATE!!!!!
I also tried and failed at multiple triggers on the same table but found this was not possible either.
What I am planning on doing is putting a trigger on the first to INSERT on the second and then putting a trigger on the second to insert on the third......etc
Would just like to know if this is a wise thing to do or not or if I am missing a better and simpler way of doing this.
Any help/advice much appreciated.

Based on the given information, it "feels" as if there might be a flaw in the database design if each of the child tables requires a row for every single row in the parent table. There is a reason that "ON INSERT CASCADE" does not exist; it is typically not considered meaningful.
The first thought that comes to mind is that the child tables should actually be part of the parent table; it sounds as if there is a one-to-one relationship. It still may make sense to have separate tables from an organizational standpoint (and size of records), but it is something to think about.
If there is not a one-to-one relationship, then the ability to add meaningful data beyond default values to the child tables would imply there might be a bit more normalization of data required. If the only values to be added are NULLs, then one could maybe argue that there is no real point in having the record because a LEFT JOIN could produce the same results without that record.
Having said all that, if it is required, I would think that it would be better to have a single trigger on the parent table add all the records to the child tables rather than chain them in several triggers. That way the logic would be contained in a single location.

Not understanding your structure (the information you need in each of these tables is pertinent to correctly answer), I can only guess that a trigger might not be what you want to do this. If your tables have other fields beyond what is in table 1 and they do not have default values, how will you get the value for those other fields inthe trigger? Personally I would use a stored proc to insert to table1 and get the id value back from the insert and then insert to the other tables with the additonal information needed and put it all in a transaction so that if one insert fails all are rolled back.

Key problem: Which key strategy should I use in my database?

Problem: When I use an auto-incrementing primary key in my database, this happens all the time:
I want to store an Order with 10 Items. The ordered Items belong to the Order. So I store the order, ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?), and then store the 10 Items with the foreign key (order_id).
So I always have to do:
INSERT ...
last_inserted_id = db.lastInsertId();
INSERT ...
INSERT ...
INSERT ...
and I believe this prevents me from using transactions in almost all INSERT cases where I need a foreign key.
So... here some solutions, and I don't know if they're really good:
A) Don't use auto_increment keys! Use a key table?
Key Table would have two fields: table_name, next_key. Every time I need a key for a table to insert a new dataset, first I ask for the next_key by accessing a special static KeyGenerator class method. This does a SELECT and an UPDATE, if possible in one transaction (would that work?). Of course I would request that for every affected table. Next, I can INSERT my entire object graph in one transaction without playing ping-pong with the database, before I know the keys already in advance.
B) Using GUUID / UUID algorithm for keys?
These suppose to be really unique worldwide, and they're LARGE. I mean ... L_A_R_G_E. So a big amount of memory would go into these gigantic keys. Indexing will be hard, right? And data retrieval will be a pain for the database - at least I guess - integer keys are much faster to handle. On the other hand, these also provide some security: Visitors can't iterate anymore over all orders or all users or all pictures by just incrementing the id parameter.
C) Stick with auto_incremented keys?
Ok, if then, what about transactions like described in the example above? How can I solve that? Maybe by inserting a Ghost Row first and then doing an transaction with one UPDATE + n INSERTs?
D) What else?

When storing orders, you need transactions to prevent situations where only half your products are added to the database.
Depending on your database and your connector, the value returned by the last-insert-id function might be transaction-independent. For instance, with MySQL, mysql_insert_id returns the identifier for the last query from that particular client (without being affected by what other clients are doing concurrently).

Which database are you using?
Yes, typically inserting a record and then trying to select it again to find the auto-generated key is bad, especially if you are using a naive select max(id) from table query. This is because as soon as two threads are creating records max(id) may not actually return the last id your current thread used.
One way to avoid this is to create a sequence in the database. From your code you select sequence.NextValue then use that value to then execute your inserts (or you can craft a more complex SQL statement that does this selection and the inserts in one go). Sequences are atomic / thread-safe.
In MySQL you can ask for the last inserted id from the execution results which I believe will always give you the correct answer.

Sql Server supports SCOPE_IDENTITY (Transact-SQL) which should take care of your transaction issue and concurrency issue.
I would say stick with auto_increment.

(Assuming you are using MySQL)
"ask the database for the last inserted id (which is dangerous when it comes to concurrency, right?)"
If you use MySQLs last_insert_id() function, you only see what happened in your session. So this is safe. You mention ths:
db.last_insert_id()
I don't know what framework or language it is, but I would assume that uses MySQL's last_insert_id() under the covers (if not, it is a pretty useless database abstraction fromework)
" I believe this prevents me from using transactions in almost all INSERT cases w"
I don't see why. Please explain.

D) Sequence
: may not be available in your DBMS, but if it is, solves your problem elegantly.
For Postgresql, have a look at Sequence Functions

There is no final and general answer to this question.
auto incrementing columns are easy to use when you add new records. To use them as foreign keys within the same transaction, they are not so straight forward. You need database specific commands to get the newly created key. This technology is common for certain databases, for instance sql server.
Sequences seem to be harder to use, because you need to get a key before you insert a row, but at the end its easier to use them as foreign keys. This technology is common for certain databases, for instance oracle.
When you use Hibernate or NHibernate, it is discouraged to use auto incrementing keys, because some optimizations are not possible anymore. Using a hi-lo algorithm which uses an additional table is recommended.
Guids are strong, for instance when sharing data between different databases, systems, disconnected scenarios, import / export etc. In many databases, most of the tables contain only a few hundred records, so memory and performance are not such an issue. When using NHibernate, you get an guid generator which produces sequential guids, because some databases perform better when keys are sequential.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.