database table design for some unknown data - php

So, not having come from a database design background, I've been tasked with designing a web app where the end user will be entering products, and specs for their products. Normally I think I would just create rows for each of the types of spec that they would be entering. Instead, they have a variety of products that don't share the same spec types, so my question is, what's the most efficient and future-proof way to organize this data? I was leaning towards pushing a serialized object into a generic "data" row, but then are you able to do full-text searches on this data? Any other avenues to explore?

split products and specifications into two tables like this:
products
id name
specifications
id name value product_id
get all the specifations of a product when you know the product id:
SELECT name,
value
FROM specifications
WHERE product_id = ?;
add a specification to a product when you know the product id, the specification's name and the value of said specification:
INSERT INTO specifications(
name,
value,
product_id
) VALUES(
?,
?,
?
);
so before you can add specifications to a product, this product must exist. also, you can't reuse specifications for several products. that would require a somewhat more complex solution :) namely...
three tables this time:
products
id name
specifications
id name value
products_specifications
product_id specification_id
get all the specifations of a product when you know the product id:
SELECT specifications.name,
specifications.value
FROM specifications
JOIN products_specifications
ON products_specifications.specification_id = specifications.id
WHERE products_specifications.product_id = ?;
now, adding a specification becomes a little bit more tricky, cause you have to check if that specification already exists. so this will be a little heavier than the first way of doing this, since there are more queries on the db, and there's more logic in the application.
first, find the id of the specification:
SELECT id
FROM specifications
WHERE name = ?
AND value = ?;
if no id is returned, this means that said specification doesn't exist, so it must be created:
INSERT INTO specifications(
name,
value
) VALUES(
?,
?
);
next, either use the id from the select query, or get the last insert id to find the id of the newly created specification. use that id together with the id of the product that's getting the new specification, and link the two together:
INSERT INTO products_specifications(
product_id,
specification_id
) VALUES(
?,
?
);
however, this means that you have to create one row for every specific specification. e.g. if you have size for shoes, there would be one row for every known shoe size
specifications
id name value
1 size 7
2 size 7½
3 size 8
and so on. i think this should be enough though.

You could take a look at using an EAV model.

I've never built a products database, but I can point you to a data model for that. It's one of over 200 models available for the taking, at Database Answers. Here is the model
If you don't like this one, you can find 15 different data models for Product oriented databases. Click on "Data Models" to get a list and scroll down to "Products".
You should pick up some good design ideas there.

This is a pretty common problem - and there are different solutions for different scenarios.
If the different types of product and their attributes are fixed and known at development time, you could look at the description in Craig Larman's book (http://www.amazon.com/Applying-UML-Patterns-Introduction-Object-Oriented/dp/0131489062/ref=sr_1_1/002-2801511-2159202?ie=UTF8&s=books&qid=1194351090&sr=1-1) - there's a section on object-relational mapping and how to handle inheritance.
This boils down to "put all the possible columns into one table", "create one table for each sub class" or "put all base class items into a common table, and put sub class data into their own tables".
This is by far the most natural way of working with a relational database - it allows you to create reports, use off-the-shelf tools for object relational mapping if that takes your fancy, and you can use standard concepts such as "not null", indexing etc.
Of course, if you don't know the data attributes at development time, you have to create a flexible database schema.
I've seen 3 general approaches.
The first is the one described by davogotland. I built a solution on similar lines for an ecommerce store; it worked great, and allowed us to be very flexible about the product database. It performed very well, even with half a million products.
Major drawbacks were creating retrieval queries - e.g. "find all products with a price under x, in category y, whose manufacturer is z". It was also tricky bringing in new developers - they had a fairly steep learning curve.
It also forced us to push a lot of relational concepts into the application layer. For instance, it was hard to create foreign keys to other tables (e.g. "manufacturer") and enforce them using standard SQL functionality.
The second approach I've seen is the one you mention - storing the variable data in some kind of serialized format. This is a pain when querying, and suffers from the same drawbacks with the relational model. Overall, I'd only want to use serialization for data you don't have to be able to query or reason about.
The final solution I've seen is to accept that the addition of new product types will always require some level of development effort - you have to build the UI, if nothing else. I've seen applications which use a scaffolding style approach to automatically generate the underlying database structures when a new product type is created.
This is a fairly major undertaking - only really suitable for major projects, though the use of ORM tools often helps.

Related

Approaches for user extensible entities with a relational database (mySQL)

We have a php/mysql system with about 5 core entities. We now need to add the ability for customers to create custom fields for some of these entities on a per project basis.
They would contain a label, key, type, default value, and possible allowed values.
This is so they could add a custom date field, or a custom dropdown to the UI and save this value against the specific entity.
What is the best approach for storing this kind of data in a mySQL database? I need to store both the config for the field, and then the current value for a specific entity.
I've had a look at various options here.. https://ayende.com/blog/3498/multi-tenancy-extensible-data-model
But this is not really at a tenancy level, more a project level.
I was thinking...
A CustomFields table to hold the configuration of a field against an entity type and project id.
A CustomFieldValues table to hold the value saved against the field - a row per field ( entity_id | field_id | field_value)
Then we create relationships between the entities and these custom values when retrieving the entities.
The issue with this is that there will be as many rows in the Values table as there are custom fields - so saving a entity will result in X extra rows. On top of that, these are versioned, so once a new version is created, there will be another X rows created for that new version.
Also, you can't index the fields on name, joins would become pretty complex i think as you have to join to the configuration and the values to build the key value pair to return against the entity, and how would you select based on a custom field name, when the filed name was actually a value?
I don't want to add dynamic columns to the table, as this will affect ALL the entites in the whole system - not just the ones in the current client / project.
The other option is to store the values in a JSON column.
This could be on the entity row itself customFields or similar. This would prevent the extra rows per field, but also has issues with lack of indexing etc, and still need to join to the config table. However, you could perform queries by the property name if the key=value was stored in the JSON... WHERE entity.customFields->"$.myCustomFieldName" > 1.
Storing the filed name in the json does mean you cannot change it once created, without a lot of pain.
If anyone has any advice on approaches for this, or articles to point me at that would be much appreciated - Im sure this has been solved many times before....
JSON records: No! A thousand times no! If you do that, just wait until somebody actually uses your system for a few tens of millions of records, then asks you to search on one of your extra fields. Your support people will curse your name.
Key-value store. Probably yes. There's a very widely deployed existence proof of this design: WordPress. It has a table called wp_postmeta, containing metadata fields applying to wp_posts (blog pages and posts). It's proven successful.
You will need to do some multiple joining to use this stuff. For example, to search on height and eye-color, you'd need
SELECT p.person_id, p.first, p.last, h.value height, e.value eye_color
FROM person p
LEFT JOIN attrib h ON p.person_id = h.person_id AND h.key='eye_color'
LEFT JOIN attrib e ON p.person_id = e.person_id AND e.key='height'
WHERE e.value='green' and CAST(h.value AS INT) < 160
As the CAST in that WHERE clause shows, you'll have some struggles with data type as well.
You'll need LEFT JOIN operations in this sort of attribute lookup; ordinary inner JOIN operations will suppress rows with missing attributes, and that might not work for you.
But, if you do a good job with indexes, you'll be able to get decent performance from this approach.
The table structure envisioned in my example doesn't have your table describing each additional field, but you know how to add that. It also doesn't have explicit support for multi-project / multitenant data separation. But you can add that as well.

Using Views for Multilingual Database

This is a subject that has been discussed multiple times and it always depens on the situation, but I like to share my idea.
Im building a new CMS that must support multilingual applications and can be installed behind existing applications.
The solutions I know and found are:
[Product]
id
price
name_en
name_de
name_fr
only getting the fields you need in your language.
or using mutliple tables like:
[product]
id
price
[languages]
id
tag
[product_translation]
product_id
language_id
name
Joining the correct language
Both situations work and have its pro's and cons. Based on your choice you have to rewrite your query's.
my idea:
[product]
id
price
name
[product_translations]
product_id
language_id
name
[product_es_view]
id -- references the product table
price -- references the product table
name -- references the translation table
Now the idea is that you create a view for every language, but the view is identical to the product table.
Why?
With this setup I can make non-multilingual sites, multilingual without editing the existing model/table. Now the only thing I have to do in my code is use another table and i get a translated version of my model (in php it could be done by adding a simple trait to your model). With SQL server and Mysql you can use updateable views which save the value's in the referenced tables.
I love to hear what you guys think of the idea, and most of all what the biggest cons are of using views for this problem ?
I prefer the second option where every entity is in its own table. If you use product_es_view then it may be easier but less clean code.
Adding new languages should usually not include adding new database tables. Adding new row to languages tables is better.

CakePHP database design: associations for job board

Hy everyone.
I'm actually building a job board with CakePHP and a little help for designing the database will be appreciated!
I have a table jobs with differents foreigns keys:
id, recruiter_id, title, sector_id, division_id, experience_id etc.
The associated table (sectors, divisions and experiences) have the same configuration id, name and job_count and sometimes on or two other fields (like company_count for sectors).
So I would like to know if there is better way to design these tables. I thought for putting the three of them in one table named lists with the keys: id, value and list_name. With this configuration I have just one request to do to get all the list and not 3.
My question is what is the "good way" solution ? May be there's another one ?
Seems kind of repetitive to have them in separate tables, when really they're all the same thing - properties of a job, and would have VERY similar table structures.
I would think you could create a single table for "job_properties" or something.
Each property could have a unique slug (if you wanted) or just use it's id.
// job_properties table example
id
slug // (optional or could be called "key" if you prefer)
type // (optional - "sector", "division", "min_exp")
name // (for use on the names of things like "marketing" or "technology")
value // (int - for use on things like minimum experience)
Then each Job would hasMany JobProperty. It would also allow any job to have more than one sector if that is ever needed.
This would allow you to pull based on if a job has a particular property or set of properties and seems overall cleaner and more consolidated while not making it too obfuscated.
I think a found a solution by using a system of taxonomy. I created a table terms which contain the list of all terms that can be associated (sector, division, type of contrat, etc.).
Table terms id, name, type
And I created a second table term_relationships which contain all the association including the name of the model that is associated.
Tabe term_relationships id, ref, ref_id, term_id
"ref" refers to the associated model (example: Job or Applicant in my case), the "ref_id" refers to the associated data (which job or which applicant) and term_id refers to which terms is associated. I think is the most evolutive and cleaner solution.
Thanks all for your help (especially Grafikart from where I get the idea) and hope that this topic can help someone else !

Data model for subscriptions, single purchase products and variable services

I am designing a database for a system that will handle subscription based products, standard one off set price purchase products, and billing of variable services.
A customer can be related to many domains and a domain may have many subscriptions, set priced products or billed variable services related to it.
I am unsure whether to seperate each of these categories into their own 'orders' table or figure out a solution to compile them all into a single orders table.
Certain information about subscriptions is required such as start date or expiry date which is irrelevant for stand alone products. Variable services could be any price so having a single products table would mean I would have to add a new product which may be never used again or might be at a different cost.
What would be the best way to tackle this, and is splitting each into seperate order tables the best way?
Have you looked at sub-typing - it might work for you. By this I mean a central "order" table containing the common attributes and optional "is a"/one-to-one relationships to specific order tables. The specific order tables contain the attributes specific only to that type of order, and a foreign key back to the central order table.
This is one way of trying to get the best of "both" worlds, by which I mean (a) the appropriate level of centralisation for reporting on things common and (b) the right level of specialisation. The added nuance comes at the cost of an extra join some times but gives you flexibility and reporting.
Although I am a little biased in favour of subtypes, some people feel that unless your order subtypes are very different, it may not be worth the effort of separating them out. There is some seemingly good discussion here on dba.stackexchange as well as here. That being said, books (or chapters at least) have been written on the subject!
As with any data model, the "best" way depends on your needs. Since you don't specify a lot of details regarding your specific needs, it's difficult to say what the bets model is.
However, in general you need to consider what level of detail is necessary. For example if all subscriptions cost the same and are billed on the 1st of the month, it may be sufficient to have a field like is_subscription ENUM ('Y', 'N') in your orders table. If billing dates and prices for subscriptions can vary however, you need to store that information too. In that case it may be better to have a separate table for subscriptions.
Another consideration is exactly what an "order" represents in your model. One interpretation is that an order includes all the purchases included in one transaction, including both one-off purchases, variable services and subscriptions. A completed order would then result in a bill, and subscriptions would be automatically billed on the proper day of the month without a new order being made.
You should aim to have one database design that is not hardwired into specifics regarding its contents, and if it does (have to) it does in such a way that it seperates the specialization from the core DB design.
There are certain fields that are common for each order. Put these in one table, and have it refer to the other rows in the respective (specialized) tables. Thats DB normalization for you.
You could have main table contain ID, OrderID, ItemType, ItemID when ItemType determines the table ItemID refers to. I advise against this, but must admit that i use this sometimes.
Better would be to have these tables:
Clients: ID, Name, Address, Phone
Sellers: ID, Name, CompanyAlias
Orders: ID, ClientID, SellerID, Date, Price
OrderItems: ID, OrderID, DiscountAmount, DiscountPercentage,
ProductDomainID, ProductBottleID, ProductCheeseID, ..
Now OrderItems is where the magic happens. The first four fields explain themselves i guess. And the rest refers to a table which you do not alter or delete anything ever:
Products_Cheese ID, ProductCode, ProductName, Price
And if you do need a variant product for a specific order add a field VariantOfID thats otherwise NULL. Then refer to that ID.
The OrderItems table you can add fields to without disturbing the established DB structure. Just constrict yourself to use NULL values in all Product*ID fields except one. But taking this further, you might even have scenario's where you want to combine two or more fields. For example adding a ExtraSubscriptionTimespanID or a ExtraServicelevelagreementID field that is set alongside the ProductCheeseID.
This way if you use ProductCheeseID + ExtraSubscriptionTimespanID + ExtraServicelevelagreementID a customer can order a Cheese subscription with SLA, and your database structure does not repeat itself.
This basic design is open to alterative ideas and solutions. But keep your structure seperated. Dont make one huge table that includes all fields you may ever need, things will break horribly once you have to change something later on.
When designing database tables, you want to focus on what an entity represents and having two or more slightly different versions of what is essentially the same entity makes the schema harder to maintain. Orders are orders, they're just different order types for different products, for different customers. You'd need a number of link tables to make it all work, but you'd have to make those associations somehow and it beats having different entity types. How about this for a very rough starting point?
What would be the best way to tackle this, and is splitting each into seperate order tables the best way?
That depends. And it will change over time. So the crucial part here is that you create a model of the orders and you separate the storage of them from just writing code dealing with those models.
That done, you can develop and change the database structure over time to store and query all the information you need to store and you need to query.
This for the general advice.
More concrete you still have to map the model onto a data-structure. There are many ways on how to solve that, e.g. a single table of which not all columns are used all the time (flat table), subtype tables (for each type a new table is used) or main table with the common fields and subtype tables containing the additional columns or even attribute tables. All these approaches have pros and cons, an answer here on Stackoverflow is most likely the wrong place to discuss these in full. However, you can find an insightful entry-level discussion of your problem here:
Entity-Attribute-Value (Chapter 6); page 61 in SQL Antipatterns - Avoiding the Pitfalls of Database Programming by Bill Karwin.

Database Structure Advice Needed

Im currently working on a site which will contain a products catalog. I am a little new to database design so I'm looking for advice on how best to do this. I am familiar with relational database design so I understand "many to many" or "one to many" etc (took a good db class in college). Here is an example of what an item might be categorized as:
Propeller -> aircraft -> wood -> brand -> product.
Instead of trying to write what I have so far, just take a quick look at this image I created from the phpmyadmin designer feature.
alt text http://www.usfultimate.com/temp/db_design.jpg
Now, this all seemed fine and dandy, until I realized that the category "wood" would also be used under propeller -> airboat -> (wood). This would mean, that "wood" would have to be recreated every time I want to use it under a different parent. This isn't the end of the world, but I wanted to know if there is a more optimal way to go about this.
Also, I am trying to keep this thing as dynamic as possible so the client can organize his catalog as his needs change.
*Edit. Was thinking about just creating a "tags" table. So I could assign the tag "wood" or "metal" or "50inch" to 1 to many items. I would still keep a parenting type thing for the main categories, but this way the categories wouldnt have to go so deep and there wouldnt be the repetition.
First, the user interface: as user I hate to search a product in a catalog organized in a strictly hierarchical way. I never remember in what sub-sub-sub-sub...-category an "exotic" product is in and this force me to waste time exploring "promising" categories just to discover it is categorized in a (for me, at least) strange way.
What Kevin Peno suggests is a good advice and is known as faceted browsing. As Marcia Bates wrote in After the Dot-Bomb: Getting Web Information Retrieval Right This Time, " .. faceted classification is to hierarchical classification as relational databases are to hierarchical databases. .. ".
In essence, faceted search allows users to search your catalog starting from whatever "facet" they prefer and let them filter information choosing other facets along the search. Note that, contrary to how tag systems are usually conceived, nothing prevents you to organize some of these facets hierarchically.
To quickly understand what faceted search is all about, there are some demos to explore at The Flamenco Search Interface Project - Search Interfaces that Flow.
Second, the application logic: what Manitra proposes is also a good advice (as I understand it), i.e. separating nodes and links of a tree/graph in different relations. What he calls "ancestor table" (which is a much better intuitive name, however) is known as transitive closure of a directed acyclic graph (DAG) (reachability relation). Beyond performance, it simplify queries greatly, as Manitra said.
But I suggest a view for such "ancestor table" (transitive closure), so that updates are in real-time and incremental, not periodical by a batch job. There is SQL code (but I think it needs to be adapted a little to specific DBMSes) in papers I mentioned in my answer to query language for graph sets: data modeling question. In particular, look at Maintaining Transitive Closure of Graphs in SQL (.ps - postscript).
Products-Categories relationship
The first point of Manitra is worth of emphasis, also.
What he is saying is that between products and categories there is a many-to-many relationship. I.e.: each product can be in one or more categories and in each category there can be zero or more products.
Given relation variables (relvars) Products and Categories such relationship can be represented, for example, as a relvar PC with at least attributes P# and C#, i.e. product and category numbers (identifiers) in a foreign-key relationships with corresponding Products and Categories numbers.
This is complementary to management of categories' hierarchies. Of course, this is only a design sketch.
On faceted browsing in SQL
A useful concept to implement "faceted browsing" is relational division, or, even, relational comparisons (see bottom of linked page). I.e. dividing PC (Products-Categories) by a (growing) list of categories chosen from a user (facet navigation) one obtains only products in such categories (of course, categories are presumed not all mutually exclusive, otherwise choosing two categories one will obtain zero products).
SQL-based DBMS usually lack this operators (division and comparisons), so I give below some interesting papers that implement/discuss them:
ON MAKING RELATIONAL DIVISION COMPREHENSIBLE (.pdf from FIE 2003 Session Index);
A simpler (and better) SQL approach to relational division (.pdf from Journal of Information Systems Education - Contents Volume 13, Number 2 (2002));
Processing frequent itemset discovery queries by division and set containment join operators;
Laws for Rewriting Queries Containing Division Operators;
Algorithms and Applications for Universal Quantification in Relational Databases;
Optimizing Queries with Universal Quantification in Object-Oriented and Object-Relational Databases;
(ACM access required) On the complexity of division and set joins in the relational algebra;
(ACM access required) Fast algorithms for universal quantification in large databases;
and so on...
I will not go into details here but interaction between categories hierarchies and facet browsing needs special care.
A digression on "flatness"
I briefly looked at the article linked by Pras, Managing Hierarchical Data in MySQL, but I stopped reading after these few lines in the introduction:
Introduction
Most users at one time or another have
dealt with hierarchical data in a SQL
database and no doubt learned that the
management of hierarchical data is not
what a relational database is intended
for. The tables of a relational
database are not hierarchical (like
XML), but are simply a flat list.
Hierarchical data has a parent-child
relationship that is not naturally
represented in a relational database
table. ...
To understand why this insistence on flatness of relations is just nonsense, imagine a cube in a three dimensional Cartesian coordinate system: it will be identified by 8 coordinates (triplets), say P1(x1,y1,z1), P2(x2,y2,z2), ..., P8(x8, y8, z8) [here we are not concerned with constraints on these coordinates so that they represent really a cube].
Now, we will put these set of coordinates (points) into a relation variable and we will name this variable Points. We will represent the relation value of Points as a table below:
Points| x | y | z |
=======+====+====+====+
| x1 | y1 | z1 |
+----+----+----+
| x2 | y2 | z2 |
+----+----+----+
| .. | .. | .. |
| .. | .. | .. |
+----+----+----+
| x8 | y8 | z8 |
+----+----+----+
Does this cube is being "flattened" by the mere act of representing it in a tabular way? Is a relation (value) the same thing as its tabular representation?
A relation variable assumes as values sets of points in a n-dimensional discrete space, where n is the number of relation attributes ("columns"). What does it mean, for a n-dimensional discrete space, to be "flat"? Just nonsense, as I wrote above.
Don't get me wrong, It is certainly true that SQL is a badly designed language and that SQL-based DBMSes are full of idiosyncrasies and shortcomings (NULLs, redundancy, ...), especially the bad ones, the DBMS-as-dumb-store type (no referential constraints, no integrity constrains, ...). But that has nothing to do with relational data model fantasized limitations, on the contrary: more they turn away from it and worse is the outcome.
In particular, the relational data model, once you understand it, poses no problem in representing whatever structure, even hierarchies and graphs, as I detailed with references to published papers mentioned above. Even SQL can, if you gloss over its deficiencies, missing something better.
On the "The Nested Set Model"
I skimmed the rest of that article and I'm not particularly impressed by such logical design: it suggests to muddle two different entities, nodes and links, into one relation and this will probably cause awkwardness. But I'm not inclined to analyze that design more thoroughly, sorry.
EDIT: Stephan Eggermont objected, in comments below, that " The flat list model is a problem. It is an abstraction of the implementation that makes performance difficult to achieve. ... ".
Now, my point is, precisely, that:
this "flat list model" is a fantasy: just because one lay out (represents) relations as tables ("flat lists") does not mean that relations are "flat lists" (an "object" and its representations are not the same thing);
a logical representation (relation) and physical storage details (horizontal or vertical decompositions, compression, indexes (hashes, b+tree, r-tree, ...), clustering, partitioning, etc.) are distinct; one of the points of relational data model (RDM) is to decouple logical from "physical" model (with advantages to both users and implementors of DBMSes);
performance is a direct consequence of physical storage details (implementation) and not of logical representation (Eggermont's comment is a classic example of logical-physical confusion).
RDM model does not constraint implementations in any way; one is free to implement tuples and relations as one see fit. Relations are not necessarily files and tuples are not necessarily records of a file. Such correspondence is a dumb direct-image implementation.
Unfortunately SQL-based DBMS implementations are, too often, dumb direct-image implementations and they suffer poor performance in a variety of scenarios - OLAP/ETL products exist to cover these shortcomings.
This is slowly changing. There are commercial and free software/open source implementations that finally avoid this fundamental pitfall:
Vertica, which is a commercial successor of..
C-Store: A Column-Oriented DBMS;
MonetDB;
LucidDB;
Kdb in a way;
an so on...
Of course, the point is not that there must exist an "optimal" physical storage design, but that whatever physical storage design can be abstracted away by a nice declarative language based on relational algebra/calculi (and SQL is a bad example) or more directly on a logic programming language (like Prolog, for example - see my answer to "prolog to SQL converter" question). A good DBMS should be change physical storage design on-the-fly, based on data access statistics (and/or user hints).
Finally, in Eggermont's comment the statement " The relational model is getting squeeezed between the cloud and prevayler. " is another nonsense but I cannot give a rebuttal here, this comment is already too long.
Before you create a hierarchical category model in your database, take a look at this article which explains the problems and the solution (using nested sets).
To summarize, using a simple parent_category_id doesn't scale very well and you'll have a hard time writing performant SQL queries. The answer is to use nested sets which make you visualize your many-to-many category model as sets which are nested inside other sets.
If you want categories to have multiple parent categories, then it's just a "many to many" relationship instead of a "one to many" relationship. You'll need to put a bridging table between category and itself.
However, I doubt this is what you want. If I'm looking in the category Aircraft > Wood then I wouldn't want to see items from Boating > Wood. There are two Wood categories because they contain different items.
My suggestions
put a many-to-many relation between Item and Category so that a product can be displayed in many hierarchy node (used in ebay, sourceforge ...)
keep the category hierarchy
Performance on the category hierarchy
If your category hierarchy is depth, then you could generate an "Ancestors" table. This table will be generated by a batch work and will contains :
ChildId (the id of a category)
AncestorId (the id of its parent, grand parent ... all ancestors category)
It means that if you have 3 categories : 1-Propeller > 2-aircraft > 3-wood
Then the Ancestor table will contain :
ChildId AncestorId
1 2
1 3
2 3
This means that to have all the children of category1, you just need 1 query and you don't have do nested query. By the way this would work not matter what is the depth of you category hierarchy.
Thanks to this table, you will need only 1 join to query against a category (with its childrens).
If you need help on how to create the Ancestor table, just let me know.
Before you create a hierarchical
category model in your database, take
a look at this article which explains
the problems and the solution (using
nested sets).
To summarize, using a simple
parent_category_id doesn't scale very
well and you'll have a hard time
writing performant SQL queries. The
answer is to use nested sets which
make you visualize your many-to-many
category model as sets which are
nested inside other sets.
It should be worth pointing out that the "multiple categories" idea is basically how "tagging" works. With the exception that, in "tagging", we allow any product to have many categories. By allowing any product to be in many categories, you allow the customer the full ability to filter their search by starting where they believe they need to start. It could be clicking on "airplanes", then "wood", then "turbojet engine" (or whatever). Or they could start their search with Wood, and get the same result.
This will give you the greatest flexibility, and the customer will enjoy a better UX, yet still allow you to maintain the hierarchy structure. So, while the quoted answer suggests letting categories be M:N to categories, my suggestion is to allow products to have M:N categories instead.
All in all the result is mostly the same, the categories will have a natural hierarchy, but this will lend to even greater flexibility.
I should also note that this doesn't prevent strict hierarchy either. You could much easily enforce hierarchy in the code where necessary (ex. only showing the categories "cars", "airplanes", and "boats" on your initial page). It just moves the "strctness" to your business logic, which might make it better in the long run.
EDIT: I just realized that you vaguly mentioned this in your answer. I actually didn't notice it, but I think this is along the lines you would want to do instead. Otherwise you are mixing two hierarchy systems into your program without much benefit.
I've done this before. I recommend starting with tagging (many-to-many relationship table to products). You can build a hierarchy relationship on top of your tags (tree, or nested sets, or whatever) a lot easier than on your products. Because tagging is relatively freeform, this also gives you the ability to allow people to categorize naturally and then later codify certain expected behaviors.
For instance, we had special tags like 2009-Nov-Special. Any product like this was eligible to show as a special on the front page for that month. So we didn't have to build a special system to handle rotating specials onto the front page we just used the existing tag system. Later this could be enhanced to hide those tags from consumers, etc.
Similarly, you can use tagging prefixes like: style:wood mfg:Nike to allow you to do relatively complex categorization and drilldowns without the difficulties of complex database reshuffling or the nightmares of EAV, all in a tagging system which gives you more flexibility to accommodate user expectations. Remember that users might expect to navigate the products in ways different than you as a database and business owner might expect. Using the tagging system can help you enable the shopping interface without compromising your inventory or sales tracking or anything else.
Now, this all seemed fine and dandy, until I realized that the category "wood" would also be used under propeller -> airboat -> (wood). This would mean, that "wood" would have to be recreated every time I want to use it under a different parent. This isn't the end of the world, but I wanted to know if there is a more optimal way to go about this.
What if you have an aircraft that is wood construction, but the propeller could be carbon fiber, fiberglas, metal, graphite?
I'd define a table of materials, and use a foreign key reference in the items table. If you want to support more than one material (IE: say there's metal re-inforcement, or screws...), then you'd need a corrollary/lookup/xref table.
MATERIALS_TYPE_CODE table
MATERIALS_TYPE_CODE pk
MATERIALS_TYPE_CODE_DESC
PRODUCTS table
PRODUCT_ID, pk
MATERIALS_TYPE_CODE fk IF only one material is ever associated
PRODUCT_MATERIALS_XREF table
PRODUCT_ID, pk
MATERIALS_TYPE_CODE pk
I would also relate products to one another using a corrollary/lookup/xref table. A product could be related to more than one kitted product:
KITTED_PRODUCTS table
PARENT_PRODUCT_ID, fk
CHILD_PRODUCT_ID, fk
...and it supports a hierarchical relationship because the child could be the parent of soemthing else.
You can easily test your DB designs at http://cakeapp.com

Categories