EAV vs. Column based organization for my data

EAV vs. Column based organization for my data - php

I'm in the process of rebuilding an application (lone developer here) using PHP and PostgreSQL. For most of the data, I'm storing it using a table with multiple columns for each attribute. However, I'm now starting to build some of the tables for the content storage. The content in this case, is multiple sections that each contain different data sets; some of the data is common and shared (and foreign key'd) and other data is very unique. In the current iteration of the application we have a table structure like this:
id | project_name | project_owner | site | customer_name | last_updated
-----------------------------------------------------------------------
1 | test1 | some guy | 12 | some company | 1/2/2012
2 | test2 | another guy | 04 | another co | 2/22/2012
Now, this works - but it gets hard to maintain for a few reasons. Adding new columns (happens rarely) requires modifying the database table. Audit/history tracking requires a separate table that mirrors the main table with additional information - which also requires modification if the main table is changed. Finally, there are a lot of columns - over 100 in some tables.
I've been brainstorming alternative approaches, including breaking out one large table into a number of smaller tables. That introduces other issues that I feel also cause problems.
The approach I am currently considering seems to be called the EAV model. I have a table that looks like this:
id | project_name | col_name | data_varchar | data_int | data_timestamp | update_time
--------------------------------------------------------------------------------------------------
1 | test1 | site | | 12 | | 1/2/2012
2 | test1 | customer_name | some company | | | 1/2/2012
3 | test1 | project_owner | some guy | | | 1/2/2012
...and so on. This has the advantage that I'm never updating, always inserting. Data is never over-written, only added. Of course, the table will eventually grow to be rather large. I have an 'index' table that lists the projects and is used to reference the 'data' table. However I feel I am missing something large with this approach. Will it scale? I originally wanted to do a simple key -> value type table, but realized I need to be able to have different data types within the table. This seems managable because the database abstraction layer I'm using will include a type that selects data from the proper column.
Am I making too much work for myself? Should I stick with a simple table with a ton of columns?

My advice is that if you can avoid using an EAV table, do so. They tend to be performance killers. They are also difficult to properly query especially for reporting (Yes let me join to this table an unknown number times to get all of the data out of it I need and, oh by the way, I don't know what columns I have available so I have no idea what columns the report will need to contain) and it is hard to get the kind of database constraints that you need to ensure data integrity (how to ensure that the required fields are filled in for instance) and it can cause you to use bad datatypes. It is far better in the long run to define tables that store the data you need.
If you are really need the functionality, then at least look into NoSQL databases which are more optimized for this sort of undefined data.

Moving your entire structure to EAV can lead to a lot of problems down the line, but it might be acceptable for the audit-trail portion of your problem since often foreign key relationships and strict datatyping may disappear over time anyway. You can probably even generate your audit tables automatically with triggers and stored procedures.
Note, however, that reconstructing old versions of records is non-trivial with an EAV audit trail and will require a fair amount of application code. The database will not be able to do it by itself.
An alternative you could consider is to store all your data (new and old records) in the same table. You can either include audit fields in the same table and leave NULL when unnecessary, or store some rows in the table being "current" and with audit-related fields stored in another table. To simplify your application, you can create a view which only shows current rows and issue queries against the view.
You can accomplish this with a joined table inheritance pattern. With joined table inheritance, you put common attributes into a base table along with a "type" column, and you can join to additional tables (which have the same primary key which is also a foreign key) based on type. Many Data-Mapper-Pattern ORMs have native support for this pattern, often called "polymorphism".
You could also use PostgreSQL's native table inheritance mechanism, but note the caveats carefully!

Related

How to manage huge data inside mysql table

I have an issue with huge data on my table.
Let suppose i have a table name employee_details and columns are
Emp-name | Emp-email | Emp-mobile | Emp-designation | Emp-salary
Suppose i have 1,00000 rows inside it. Then how should i structure table for best performance.

100000 rows are not a big deal for Mysql. They are designed to handle millions of rows. Make sure your datatypes and indexes are setup properly and you should be good.

Normalizing a simple SQL Table

I have two different tables and I am not sure of the best way to get it out of the first normal form and into the second normal form. The first table hold the user information while the second is the products associated with the account. If I do it this way, I know it is only in the NF1 and that the foreign key of User_ID will be repeated many times in Table 2. See the tables below.
Table 1
|User_ID (primary)| Name | Address | Email | Username | Password |
Table 2
| Product_ID (Primary Key) | User_ID (Foreign Key) |
Is this a better way to make table two in which the user ID is not repeated? I have thought about having a separate table in the database for each user, but from all of the other questions I read on StackOverFlow, this is not a good idea.
The constraints I am working with are 1-1000 users and Table Two will have approximately 1-1000 indexes per user. Is there a better way to create this set of tables?

I don't see NF2 violated. It states:
a table is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table.
quoted from Wikipedia article "Second normal form", 2016-11-26
Table 2 has only one candidate key, the primary key. The primary key consists of only one column. So, there is no proper subset of a candidate key. So, NF2 can't be violated unless NF1 is not fulfilled.

you says "to make table two in which the user ID is not repeated"
then why you dont do
Table 1
|User_ID (primary)| Name | Address | Email | Username | Password | Product_ID ( Foreign Key nullable)|
Table 2
| Product_ID (Primary Key)|

There's nothing wrong with a value appearing many times. Redundancy arises when two queries that aren't syntactically equivalent always both return the same value. Only uncontrolled redundancy is bad. Normalization controls some redundancy by replacing a table by smaller ones that join to it.
Normalization decomposes a table independently of other tables. (We define the normal form of a database as the lowest normal form that all of its tables are in.) Foreign keys have nothing to do with violating normal forms.
Learn what it means for a table to be in a given normal form. You will need to learn a definition. And the definitions of the terms it uses. And the definitions of the terms they use. Etc. A table is in 2NF when every non-prime column has a functional dependency that is full on every candidate key. Also learn the algorithm for decomposing a table into components that are in a given normal form. Assuming that these tables can hold more than one row, so that {} is not a candidate key, both these tables are in 2NF.
A table in 2NF is also in 1NF. So you don't want "to get it out of the first normal form".
2NF is unimportant. When dealing with functional dependencies, what matters is BCNF, which decomposes as much as possible but requires certain higher-cost contraints, and 3NF, which doesn't decompose as much as possble but requires certain lower-cost constraints.

Storing assignments between 2 tables in MySQL

I am wondering what is the best solutions to store relations between 2 tables in mysql.
I have following structure
Table: categories
id | name | etc...
_______________________________
1 | Graphic cards | ...
2 | Processors | ...
3 | Hard Drives | ...
Table: properties_of_categories
id | name
_____________________
1 | Capacity
2 | GPU Speed
3 | Memory size
4 | Clock rate
5 | Cache
Now I need them to have connections, and question is what is a better, more efficient and lighter solution, which is important because there may be hundreds of categories and thousands of properties assigned to them.
Should I just create another table with a structure like
categoryId | propertyId
Or perhaps add another column to categories table and store properties in text field like 1,7,19,23
Or maybe create json files named for example 7.json with content like
{1,7,19,23}

As this question is pertaining to Relational World, I would suggest to add another table to store many to many relationship between Category and Property.
You can also use JSON column to store many values in one of the table.
JSON Datatype is introduced in MYSQL 5.7 and it comes with various features for JSON data retrieval and updation. However if you are using older version, you would need to manage it with string column with some cumbersome queries for string manipulation.

The required structure depends on the relationship type: one-to-many, many-to-one, or many-to-many (M2M).
For a one-to-many, a foreign key (FK) on the 'many' side relates many items to the 'one' side. The reverse is correct for many-to-one.
For many-to-many (M2M) you need an intermediate relational (or junction) table exactly as you suggest. This allows you to "reuse" both categories and properties in any combinations. However it's slightly more SQL - requiring 2 JOINs.
If you are looking for performance, then using FKs to primary keys (PKs) would be very efficient and the queries are pretty simple. Using JSON would presumably require you to parse in PHP and construct on-the-fly second queries which would multiply your coding work and testing, data transfer, CPU overhead, and limit scalability.
In your case I'm guessing that both "graphics cards" and "hard drives" could have e.g. "memory size" plus other properties, so you would need a M2M relational table as you suggest.
As long as your keys are indexed (which PKs are), your JOIN to this relational table will be very quick and efficient.
If you use CONSTRAINTs with your relations, they you ensure you maintain data integrity: you cannot delete a category to which a property is "attached". This is a good feature in the long run.
Hundreds and thousands of records is a tiny amount for MySQL. You would use this technique even with millions of records. So there's no worry about size.
RDBMS databases are designed specifically to do this, so I would recommend using the native features than try to do it yourself in JSON. (unless I'm missing some new JSON MySQL feature! *)
* Since posting this, I indeed stumbled across a new JSON MySQL feature. It seems, from a quick read, you could implement all sorts of new structures and relations using JSON and virtual column keys, possibly removing the need for junction tables. This will probably blur the line between MySQL as an RDBMS and NoSQL.

The first solution is better when it comes to relational databases. You should create a table that will pair each category to multiple properties (1:n relationship)
You could structure the table like so:
CREATE TABLE categories_properties_match(
categoryId INTEGER NOT NULL,
propertyId INTEGER NOT NULL,
PRIMARY KEY(categoryId, propertyId),
FOREIGN KEY(categoryId) REFERENCES categories(id) ON UPDATE CASCADE ON DELETE CASCADE,
FOREIGN KEY(propertyId) REFERENCES properties_of_categories(id) ON UPDATE CASCADE ON DELETE CASCADE
);
The primary key ensures that there will be no duplicate entries, that means entries that match one category to the same property twice

MySQL BLOB data in same table or not

I have one varchar and two BLOB types of data for recipes. I don't need relations between data. For example I don't need to know which meals need potato etc.
I'll get meal's materails from database, edit them and save them again as BLOB. Then I will create a binary text file (~100KB) on the fly and save it in another column named binary data.
So my question is, does splitting table into two makes sense? Putting one BLOB in one table and another BLOB in another table changes performance (in theoretically). Or doesn't it change anything except backup issues ?
+-id--+-meal name (varchar)----+-materials (BLOB)------------+-binary data (BLOB)---+
| 1 | meatball | (meat, potato, bread etc.) | (some binary files) |
| 2 | omelette | (potato, egg, etc.) | (other binary files) |
+-----+------------------------+-----------------------------+----------------------+

If you will be using a ORM, better use the split table approach.
Otherwise, when you ask for the materials, the ORM will usually fetch all available fields... So reading big and unnecessary "binary" objects.
On other side of things... If you'll serve the binary results, a better approach would be to save the files and serve them directly.

It's more a design choice than a specific performance improvement. This assumes your query is not doing a catch-all "SELECT *". Your queries should always target the specific columns you are interested in for a given purpose.
If you do not anticipate the BLOB types for a specific meal growing past your current expectation, then keeping it in one table is an appropriate choice. This is assuming there is a one-to-one relationship between them.
However, if there is any chance there might be any need for more BLOB objects for a meal, then yes I would consider splitting it out to a new table and cross-references. Somtimes, it is better to be safe than sorry though.

Implementing order in a PHP/MySQL CMS & dealing with concurrency

I have the following tables:
======================= =======================
| galleries | | images |
|---------------------| |---------------------|
| PK | gallery_id |<--\ | PK | image_id |
| | name | \ | | title |
| | description | \ | | description |
| | max_images | \ | | filename |
======================= \-->| FK | gallery_id |
=======================
I need to implement a way for the images that are associated with a gallery to be sorted into a specific order. It is my understanding that relational databases are not designed for hierarchical ordering.
I also wish to prepare for the possibility of concurrency, even though it is highly unlikely to be an issue in my current project, as it is a single-user app. (So, the priority here is dealing with allowing the user to rearrange the order).
I am not sure the best way to go about this, as I have never implemented ordering in a database and am new to concurrency. Because of this I have read about locking MySQL tables and am not sure if this is a situation where I should implement it.
Here are my two ideas:
Add a column named order_num to the images table. Lock the table and allow the client to rearrange the order of the images, then update the table and unlock it.
Add a column named order_num to the images table (just as idea 1 above). Allow the client to update one image's place at a time without locking.
Thanks!

Here's my thought: you don't want to put too many man-hours into a problem that isn't likely to happen. Therefore, take a simple solution that's not going to cause a lot of side effects, and fix it later if it's a problem.
In a web-based world, you don't want to lock a table for a user to do edits and then wait until they're done to unlock the table. User 1 in this scenario may never come back, they may lose their session, or their browser could crash, etc. That means you have to do a lot of work to figure out when to unlock the table, plus code to let user 2 know that the table's locked, and they can't do anything with it.
I'd suggest this design instead: let them both go into edit mode, probably in their browser, with some javascript. They can drag images around in order until their happy, then they submit the order in full. You update your order_num field in a single transaction to the database.
In this scenario the worst thing that happens is that user 1 and user 2 are editing at the same time, and whoever edits last is the one whose order is preserved. Maybe they update at the exact same time, but the database will handle that, as it's going to queue up transactions.
The fallback to this problem is that whoever got their order overwritten has to do it again. Annoying but there's no loss, and the code to implement this is much simpler than the code to handle locking.
I hate to sidestep your question, but that's my thoughts about it.

If you don't want "per user sortin" the order_num column seems the right way to go.
If you choose InnoDB for your storage subsystem you can use transactions and won't have to lock the table.

Relational database and hierarchy:
I use id (auto increment) and parent columns to achieve hierarchy. A parent of zero is always the root element. You could order by id, parent.
Concurrency:
This is an easy way to deal with concurrency. Use a version column. If the version has changed since user 1 started editing, block the save, offer to reload edit. Increment the version after each successful edit.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.