Im currently working on a site which will contain a products catalog. I am a little new to database design so I'm looking for advice on how best to do this. I am familiar with relational database design so I understand "many to many" or "one to many" etc (took a good db class in college). Here is an example of what an item might be categorized as:
Propeller -> aircraft -> wood -> brand -> product.
Instead of trying to write what I have so far, just take a quick look at this image I created from the phpmyadmin designer feature.
alt text http://www.usfultimate.com/temp/db_design.jpg
Now, this all seemed fine and dandy, until I realized that the category "wood" would also be used under propeller -> airboat -> (wood). This would mean, that "wood" would have to be recreated every time I want to use it under a different parent. This isn't the end of the world, but I wanted to know if there is a more optimal way to go about this.
Also, I am trying to keep this thing as dynamic as possible so the client can organize his catalog as his needs change.
*Edit. Was thinking about just creating a "tags" table. So I could assign the tag "wood" or "metal" or "50inch" to 1 to many items. I would still keep a parenting type thing for the main categories, but this way the categories wouldnt have to go so deep and there wouldnt be the repetition.
First, the user interface: as user I hate to search a product in a catalog organized in a strictly hierarchical way. I never remember in what sub-sub-sub-sub...-category an "exotic" product is in and this force me to waste time exploring "promising" categories just to discover it is categorized in a (for me, at least) strange way.
What Kevin Peno suggests is a good advice and is known as faceted browsing. As Marcia Bates wrote in After the Dot-Bomb: Getting Web Information Retrieval Right This Time, " .. faceted classification is to hierarchical classification as relational databases are to hierarchical databases. .. ".
In essence, faceted search allows users to search your catalog starting from whatever "facet" they prefer and let them filter information choosing other facets along the search. Note that, contrary to how tag systems are usually conceived, nothing prevents you to organize some of these facets hierarchically.
To quickly understand what faceted search is all about, there are some demos to explore at The Flamenco Search Interface Project - Search Interfaces that Flow.
Second, the application logic: what Manitra proposes is also a good advice (as I understand it), i.e. separating nodes and links of a tree/graph in different relations. What he calls "ancestor table" (which is a much better intuitive name, however) is known as transitive closure of a directed acyclic graph (DAG) (reachability relation). Beyond performance, it simplify queries greatly, as Manitra said.
But I suggest a view for such "ancestor table" (transitive closure), so that updates are in real-time and incremental, not periodical by a batch job. There is SQL code (but I think it needs to be adapted a little to specific DBMSes) in papers I mentioned in my answer to query language for graph sets: data modeling question. In particular, look at Maintaining Transitive Closure of Graphs in SQL (.ps - postscript).
Products-Categories relationship
The first point of Manitra is worth of emphasis, also.
What he is saying is that between products and categories there is a many-to-many relationship. I.e.: each product can be in one or more categories and in each category there can be zero or more products.
Given relation variables (relvars) Products and Categories such relationship can be represented, for example, as a relvar PC with at least attributes P# and C#, i.e. product and category numbers (identifiers) in a foreign-key relationships with corresponding Products and Categories numbers.
This is complementary to management of categories' hierarchies. Of course, this is only a design sketch.
On faceted browsing in SQL
A useful concept to implement "faceted browsing" is relational division, or, even, relational comparisons (see bottom of linked page). I.e. dividing PC (Products-Categories) by a (growing) list of categories chosen from a user (facet navigation) one obtains only products in such categories (of course, categories are presumed not all mutually exclusive, otherwise choosing two categories one will obtain zero products).
SQL-based DBMS usually lack this operators (division and comparisons), so I give below some interesting papers that implement/discuss them:
ON MAKING RELATIONAL DIVISION COMPREHENSIBLE (.pdf from FIE 2003 Session Index);
A simpler (and better) SQL approach to relational division (.pdf from Journal of Information Systems Education - Contents Volume 13, Number 2 (2002));
Processing frequent itemset discovery queries by division and set containment join operators;
Laws for Rewriting Queries Containing Division Operators;
Algorithms and Applications for Universal Quantification in Relational Databases;
Optimizing Queries with Universal Quantification in Object-Oriented and Object-Relational Databases;
(ACM access required) On the complexity of division and set joins in the relational algebra;
(ACM access required) Fast algorithms for universal quantification in large databases;
and so on...
I will not go into details here but interaction between categories hierarchies and facet browsing needs special care.
A digression on "flatness"
I briefly looked at the article linked by Pras, Managing Hierarchical Data in MySQL, but I stopped reading after these few lines in the introduction:
Introduction
Most users at one time or another have
dealt with hierarchical data in a SQL
database and no doubt learned that the
management of hierarchical data is not
what a relational database is intended
for. The tables of a relational
database are not hierarchical (like
XML), but are simply a flat list.
Hierarchical data has a parent-child
relationship that is not naturally
represented in a relational database
table. ...
To understand why this insistence on flatness of relations is just nonsense, imagine a cube in a three dimensional Cartesian coordinate system: it will be identified by 8 coordinates (triplets), say P1(x1,y1,z1), P2(x2,y2,z2), ..., P8(x8, y8, z8) [here we are not concerned with constraints on these coordinates so that they represent really a cube].
Now, we will put these set of coordinates (points) into a relation variable and we will name this variable Points. We will represent the relation value of Points as a table below:
Points| x | y | z |
=======+====+====+====+
| x1 | y1 | z1 |
+----+----+----+
| x2 | y2 | z2 |
+----+----+----+
| .. | .. | .. |
| .. | .. | .. |
+----+----+----+
| x8 | y8 | z8 |
+----+----+----+
Does this cube is being "flattened" by the mere act of representing it in a tabular way? Is a relation (value) the same thing as its tabular representation?
A relation variable assumes as values sets of points in a n-dimensional discrete space, where n is the number of relation attributes ("columns"). What does it mean, for a n-dimensional discrete space, to be "flat"? Just nonsense, as I wrote above.
Don't get me wrong, It is certainly true that SQL is a badly designed language and that SQL-based DBMSes are full of idiosyncrasies and shortcomings (NULLs, redundancy, ...), especially the bad ones, the DBMS-as-dumb-store type (no referential constraints, no integrity constrains, ...). But that has nothing to do with relational data model fantasized limitations, on the contrary: more they turn away from it and worse is the outcome.
In particular, the relational data model, once you understand it, poses no problem in representing whatever structure, even hierarchies and graphs, as I detailed with references to published papers mentioned above. Even SQL can, if you gloss over its deficiencies, missing something better.
On the "The Nested Set Model"
I skimmed the rest of that article and I'm not particularly impressed by such logical design: it suggests to muddle two different entities, nodes and links, into one relation and this will probably cause awkwardness. But I'm not inclined to analyze that design more thoroughly, sorry.
EDIT: Stephan Eggermont objected, in comments below, that " The flat list model is a problem. It is an abstraction of the implementation that makes performance difficult to achieve. ... ".
Now, my point is, precisely, that:
this "flat list model" is a fantasy: just because one lay out (represents) relations as tables ("flat lists") does not mean that relations are "flat lists" (an "object" and its representations are not the same thing);
a logical representation (relation) and physical storage details (horizontal or vertical decompositions, compression, indexes (hashes, b+tree, r-tree, ...), clustering, partitioning, etc.) are distinct; one of the points of relational data model (RDM) is to decouple logical from "physical" model (with advantages to both users and implementors of DBMSes);
performance is a direct consequence of physical storage details (implementation) and not of logical representation (Eggermont's comment is a classic example of logical-physical confusion).
RDM model does not constraint implementations in any way; one is free to implement tuples and relations as one see fit. Relations are not necessarily files and tuples are not necessarily records of a file. Such correspondence is a dumb direct-image implementation.
Unfortunately SQL-based DBMS implementations are, too often, dumb direct-image implementations and they suffer poor performance in a variety of scenarios - OLAP/ETL products exist to cover these shortcomings.
This is slowly changing. There are commercial and free software/open source implementations that finally avoid this fundamental pitfall:
Vertica, which is a commercial successor of..
C-Store: A Column-Oriented DBMS;
MonetDB;
LucidDB;
Kdb in a way;
an so on...
Of course, the point is not that there must exist an "optimal" physical storage design, but that whatever physical storage design can be abstracted away by a nice declarative language based on relational algebra/calculi (and SQL is a bad example) or more directly on a logic programming language (like Prolog, for example - see my answer to "prolog to SQL converter" question). A good DBMS should be change physical storage design on-the-fly, based on data access statistics (and/or user hints).
Finally, in Eggermont's comment the statement " The relational model is getting squeeezed between the cloud and prevayler. " is another nonsense but I cannot give a rebuttal here, this comment is already too long.
Before you create a hierarchical category model in your database, take a look at this article which explains the problems and the solution (using nested sets).
To summarize, using a simple parent_category_id doesn't scale very well and you'll have a hard time writing performant SQL queries. The answer is to use nested sets which make you visualize your many-to-many category model as sets which are nested inside other sets.
If you want categories to have multiple parent categories, then it's just a "many to many" relationship instead of a "one to many" relationship. You'll need to put a bridging table between category and itself.
However, I doubt this is what you want. If I'm looking in the category Aircraft > Wood then I wouldn't want to see items from Boating > Wood. There are two Wood categories because they contain different items.
My suggestions
put a many-to-many relation between Item and Category so that a product can be displayed in many hierarchy node (used in ebay, sourceforge ...)
keep the category hierarchy
Performance on the category hierarchy
If your category hierarchy is depth, then you could generate an "Ancestors" table. This table will be generated by a batch work and will contains :
ChildId (the id of a category)
AncestorId (the id of its parent, grand parent ... all ancestors category)
It means that if you have 3 categories : 1-Propeller > 2-aircraft > 3-wood
Then the Ancestor table will contain :
ChildId AncestorId
1 2
1 3
2 3
This means that to have all the children of category1, you just need 1 query and you don't have do nested query. By the way this would work not matter what is the depth of you category hierarchy.
Thanks to this table, you will need only 1 join to query against a category (with its childrens).
If you need help on how to create the Ancestor table, just let me know.
Before you create a hierarchical
category model in your database, take
a look at this article which explains
the problems and the solution (using
nested sets).
To summarize, using a simple
parent_category_id doesn't scale very
well and you'll have a hard time
writing performant SQL queries. The
answer is to use nested sets which
make you visualize your many-to-many
category model as sets which are
nested inside other sets.
It should be worth pointing out that the "multiple categories" idea is basically how "tagging" works. With the exception that, in "tagging", we allow any product to have many categories. By allowing any product to be in many categories, you allow the customer the full ability to filter their search by starting where they believe they need to start. It could be clicking on "airplanes", then "wood", then "turbojet engine" (or whatever). Or they could start their search with Wood, and get the same result.
This will give you the greatest flexibility, and the customer will enjoy a better UX, yet still allow you to maintain the hierarchy structure. So, while the quoted answer suggests letting categories be M:N to categories, my suggestion is to allow products to have M:N categories instead.
All in all the result is mostly the same, the categories will have a natural hierarchy, but this will lend to even greater flexibility.
I should also note that this doesn't prevent strict hierarchy either. You could much easily enforce hierarchy in the code where necessary (ex. only showing the categories "cars", "airplanes", and "boats" on your initial page). It just moves the "strctness" to your business logic, which might make it better in the long run.
EDIT: I just realized that you vaguly mentioned this in your answer. I actually didn't notice it, but I think this is along the lines you would want to do instead. Otherwise you are mixing two hierarchy systems into your program without much benefit.
I've done this before. I recommend starting with tagging (many-to-many relationship table to products). You can build a hierarchy relationship on top of your tags (tree, or nested sets, or whatever) a lot easier than on your products. Because tagging is relatively freeform, this also gives you the ability to allow people to categorize naturally and then later codify certain expected behaviors.
For instance, we had special tags like 2009-Nov-Special. Any product like this was eligible to show as a special on the front page for that month. So we didn't have to build a special system to handle rotating specials onto the front page we just used the existing tag system. Later this could be enhanced to hide those tags from consumers, etc.
Similarly, you can use tagging prefixes like: style:wood mfg:Nike to allow you to do relatively complex categorization and drilldowns without the difficulties of complex database reshuffling or the nightmares of EAV, all in a tagging system which gives you more flexibility to accommodate user expectations. Remember that users might expect to navigate the products in ways different than you as a database and business owner might expect. Using the tagging system can help you enable the shopping interface without compromising your inventory or sales tracking or anything else.
Now, this all seemed fine and dandy, until I realized that the category "wood" would also be used under propeller -> airboat -> (wood). This would mean, that "wood" would have to be recreated every time I want to use it under a different parent. This isn't the end of the world, but I wanted to know if there is a more optimal way to go about this.
What if you have an aircraft that is wood construction, but the propeller could be carbon fiber, fiberglas, metal, graphite?
I'd define a table of materials, and use a foreign key reference in the items table. If you want to support more than one material (IE: say there's metal re-inforcement, or screws...), then you'd need a corrollary/lookup/xref table.
MATERIALS_TYPE_CODE table
MATERIALS_TYPE_CODE pk
MATERIALS_TYPE_CODE_DESC
PRODUCTS table
PRODUCT_ID, pk
MATERIALS_TYPE_CODE fk IF only one material is ever associated
PRODUCT_MATERIALS_XREF table
PRODUCT_ID, pk
MATERIALS_TYPE_CODE pk
I would also relate products to one another using a corrollary/lookup/xref table. A product could be related to more than one kitted product:
KITTED_PRODUCTS table
PARENT_PRODUCT_ID, fk
CHILD_PRODUCT_ID, fk
...and it supports a hierarchical relationship because the child could be the parent of soemthing else.
You can easily test your DB designs at http://cakeapp.com
Related
I am building a classified ads website, similar to craigslist.
The site is divided in to several sections; i.e. Forums, For Sale, Services Offered, etc.
Under each Section there are several [categories], i.e.Forums[pets], Forums[Books], For Sale[barter], Services[Barter], etc. (Notice that some categories are only uniquely identified by their section, such as in the case with barter and barter from two sections "For Sale" and "Services".)
Users will post to the categories from post links within each section. Users can upload photos and select certain amenities for their products, if applicable. A forum post will not need an amenity attribute whereas a Vehicle Ad might. Amenities include: auto transmission for vehicle ads, or furnished for housing rentals.
I am trying to figure the best logical setup for the database schema.
Currently I have this type of logical structure for basic input/query:
SECTIONS TABLE- section_id, section
CATEGORIES TABLE- cat-id, category, section_id(foreign key)
AMENITIES TABLE- amen_id, amenity
PHOTOS TABLE- photos-id, file
POST TABLE- post_id, category, timestamp, description
SECTION_POST TABLE- section_id, post_id
POST_AMENITY TABLE- post_id, amenity_id
POST_PHOTO TABLE- post_id, photo_id
I made the [SECTION_POST] my main many-to-many because the category in the [POST] must be related to the section. I related CATEGORY to SECTION in the categories table, which looks to me like a M-to-M with the addition of category attribute. Is this ok?
Also, do you have any other suggestions as to how i should be thinking on this schema? I think the problem I am having is mostly related to ignorance not lack of organizational skills. Maybe one of you can educate me or refer me to a decent link that tackles my general problem.
Your design is fairly standard. Here's a couple of comments, and some things to consider:
For your keys where you have an implied dependent relationship
(SECTION_POST for example) many ORM libraries have issues with
dependent relationships and the resultant concatenated keys. There's
also the issue of key allocation. For both of those reasons, many
people will instead give that table its own independent key (which
can conveniently be made AUTO_INCREMENT) and move the original PK to
foreign keys.
In terms of SECTION/CATEGORY, only you can say how important of a concept/entity SECTION is, however the obvious questions would
be, how is a SECTION in any way different from a CATEGORY. You
could have the same structure, with even more flexibility by
having only CATEGORY with a self referencing "PARENT_CATEGORY_ID"
column. This would allow you to define a tree structure of
categories, while at the same time, it's simple to get the top
level categories using IS NULL on the PARENT_CATEGORY_ID.
I'm not clear on how you plan to relate photos to a POST, but it would be nice from a design standpoint to support a M-M
relationship so that you can have multiple photos for a single
post.
Otherwise, you seem to have a good handle on relational design basics. I do a lot of database design, and prefer to use a commercial erd design tool, but there are some free options like Mysql workbench (assuming you're designing for mysql) that can help you visualize your design, and insure that all the SQL DDL is correct. It's also nice to have documentation for the development phase of your project.
I need to build a family tree in php and MySQL. I'm pretty surprised at the lack of open source customizable html family tree-building software there is out there, but I digress. I have spent a lot of time reading about storing MySQL digraphs and family trees. Everything makes sense to me: have a table with nodes (people) and a table with edges (relationships).
The only problem I have is I'm not sure of the best way to store relationships that are not necessarily adjacent, for example sibling and grandparent relationships. At first I didn't think this would be a big deal because I can just invisibly enforce a parent (everyone has parents) that would resolve these connections.
However, I also need to be able to store relationships that may not have a common parent such as romantic partners. Everything I have read suggests a parent-child relationship, but since romantic partners do not share a common parent (hopefully), I'm not sure how to store it in the edges table. Should I use a different table, or what? If it's in the same table, how do I represent this? As long as I am doing this with non-familiar relationships, I might as well do it with family too.
To sum up, three questions:
How do I represent lateral relationships?
If a lateral relationship has a common parent, how do I store it? Should this be a family flag on the table where other lateral relationships are stored?
How do I store parent-child relationships where the child is two or more edges away (a grandparent), but the immediate parent is unavailable?
Any help is appreciated, and if anyone has any suggestion for some javascript/html family tree building software, that would be wonderful.
An idea that comes from the Geneapro schema and RootsMagic.
person
------
person_id
name (etc)
life_event_types
----------------
life_event_type_id
life_event_type_description (divorce, marriage, birth, death)
life_events
-----------
life_event_id
life_event_type_id
life_event_description
life_event_date
life_event_roles
----------------
life_event_role_id
life_event_role (mother, father, child)
person_event_role
-----------------
person_id - who
life_event_id - what happened
life_event_role_id - what this person did
So you could have a life event of type "birth", and the role_id tells you who were the parents, and who was the child. This can be extended to marriages, deaths, divorces, foster parents, surrogate parents (where you might have 3 or 4 parents with a very complicated relationship), etc.
As for storing more distant relationships, you can calculate these. For example, you can calculate the Father of anybody by getting the person who has the 'father' role with a matching event_id. You can then get the father of that person, and you have the grandfather of the original person. Anywhere that somebody is unknown, create the person with unknown data.
person
-------
person_id
other_stuff
relation
----------
person_id_1
person_id_2
relationship_type
begin_dt
end_dt
just populate the relationship type with any value you are interested in. (FK to some picklist would be great)
I put the dates on for an interesting subdiscussion/thought provokation.
The GEDCOM data model and the Gramps data model are two of the most popular formats for exchanging geneological data between different tools. Using either of these data models should both (1) make your tool more compatible with other tools and (2) ensure that your data model is compabible with many special cases, considering both data models are specially designed to deal with geneological data.
Tools like Oxy-Gen or the Gramps PHP exporter should get you on your way with respect to how to import GEDCOM data into a database.
For more details, see also my answer to “Family Tree” Data Structure.
So, not having come from a database design background, I've been tasked with designing a web app where the end user will be entering products, and specs for their products. Normally I think I would just create rows for each of the types of spec that they would be entering. Instead, they have a variety of products that don't share the same spec types, so my question is, what's the most efficient and future-proof way to organize this data? I was leaning towards pushing a serialized object into a generic "data" row, but then are you able to do full-text searches on this data? Any other avenues to explore?
split products and specifications into two tables like this:
products
id name
specifications
id name value product_id
get all the specifations of a product when you know the product id:
SELECT name,
value
FROM specifications
WHERE product_id = ?;
add a specification to a product when you know the product id, the specification's name and the value of said specification:
INSERT INTO specifications(
name,
value,
product_id
) VALUES(
?,
?,
?
);
so before you can add specifications to a product, this product must exist. also, you can't reuse specifications for several products. that would require a somewhat more complex solution :) namely...
three tables this time:
products
id name
specifications
id name value
products_specifications
product_id specification_id
get all the specifations of a product when you know the product id:
SELECT specifications.name,
specifications.value
FROM specifications
JOIN products_specifications
ON products_specifications.specification_id = specifications.id
WHERE products_specifications.product_id = ?;
now, adding a specification becomes a little bit more tricky, cause you have to check if that specification already exists. so this will be a little heavier than the first way of doing this, since there are more queries on the db, and there's more logic in the application.
first, find the id of the specification:
SELECT id
FROM specifications
WHERE name = ?
AND value = ?;
if no id is returned, this means that said specification doesn't exist, so it must be created:
INSERT INTO specifications(
name,
value
) VALUES(
?,
?
);
next, either use the id from the select query, or get the last insert id to find the id of the newly created specification. use that id together with the id of the product that's getting the new specification, and link the two together:
INSERT INTO products_specifications(
product_id,
specification_id
) VALUES(
?,
?
);
however, this means that you have to create one row for every specific specification. e.g. if you have size for shoes, there would be one row for every known shoe size
specifications
id name value
1 size 7
2 size 7½
3 size 8
and so on. i think this should be enough though.
You could take a look at using an EAV model.
I've never built a products database, but I can point you to a data model for that. It's one of over 200 models available for the taking, at Database Answers. Here is the model
If you don't like this one, you can find 15 different data models for Product oriented databases. Click on "Data Models" to get a list and scroll down to "Products".
You should pick up some good design ideas there.
This is a pretty common problem - and there are different solutions for different scenarios.
If the different types of product and their attributes are fixed and known at development time, you could look at the description in Craig Larman's book (http://www.amazon.com/Applying-UML-Patterns-Introduction-Object-Oriented/dp/0131489062/ref=sr_1_1/002-2801511-2159202?ie=UTF8&s=books&qid=1194351090&sr=1-1) - there's a section on object-relational mapping and how to handle inheritance.
This boils down to "put all the possible columns into one table", "create one table for each sub class" or "put all base class items into a common table, and put sub class data into their own tables".
This is by far the most natural way of working with a relational database - it allows you to create reports, use off-the-shelf tools for object relational mapping if that takes your fancy, and you can use standard concepts such as "not null", indexing etc.
Of course, if you don't know the data attributes at development time, you have to create a flexible database schema.
I've seen 3 general approaches.
The first is the one described by davogotland. I built a solution on similar lines for an ecommerce store; it worked great, and allowed us to be very flexible about the product database. It performed very well, even with half a million products.
Major drawbacks were creating retrieval queries - e.g. "find all products with a price under x, in category y, whose manufacturer is z". It was also tricky bringing in new developers - they had a fairly steep learning curve.
It also forced us to push a lot of relational concepts into the application layer. For instance, it was hard to create foreign keys to other tables (e.g. "manufacturer") and enforce them using standard SQL functionality.
The second approach I've seen is the one you mention - storing the variable data in some kind of serialized format. This is a pain when querying, and suffers from the same drawbacks with the relational model. Overall, I'd only want to use serialization for data you don't have to be able to query or reason about.
The final solution I've seen is to accept that the addition of new product types will always require some level of development effort - you have to build the UI, if nothing else. I've seen applications which use a scaffolding style approach to automatically generate the underlying database structures when a new product type is created.
This is a fairly major undertaking - only really suitable for major projects, though the use of ORM tools often helps.
i am developing an application for a real-estate company. the problem i am facing is about implementing the database. however i am just confused on which way to adopt i would appreciate if you could help me out in reasoning the database implementation.
here is my situation.
a) i have to store the property details in the database.
b) the properties have approximately 4-5 categories to which it will belong for ex : resedential, commnercial, industrial etc.
c) now the categories have sub-categories. for example. a residential category will have sub category such as. Apartment / Independent House / Villa / Farm House/ Studio Apartment etc. and hence same way commercial and industrial or agricultural will too have sub-categories.
d) each sub-categories will have to store the different values. like a resident will have features like Bedrooms/ kitchens / Hall / bathroom etc. the features depends on the sub categories.
for an example on how i would want to implement my application you can have a look at this site.
http://www.magicbricks.com/bricks/postProperty.html
i could possibly think of the solution like this.
a) create four to five tables depending upon the categories which will be existing(the problem is categories might increase in the future).
b) create different tables for all the features, location, price, description and merge the common property table into one. for example all the property will have the common entity such as location, total area, etc.
what would you advice for me given the current situation.
thank you
In order to implement this properly you need to know (read) about database normalization.
Every entity needs its own table. You will have tables for:
objects (real estate objects)
categories
transactionTypes
... etc.
If you have hierarchical categories, strictly organised in a tree structure, you may want to implement this as a tree structure, all stored in one table. If there are possibilities of overlaps, then it means you need to have different tables for each, like:
propertyTypes
propertyRatings
propertyAvailability
... etc.
Generally you could have a table for each property "type" containing the "type" specific information but also have a corresponding "common" table that would contain common fields between all types such as "price", "address", etc...
This is how MLS data is structured.
Categories of properties is yet another example of the gen-spec design pattern.
For a prior discussion on gen-spec here is the link.
I'm currently designing an application using PHP and MySQL, built on the Kohana framework. I'm making use of the built in ORM and it has proved to be extremely useful. Everything works fine, but I'm very concerned with the number of queries being run on certain pages.
Setting
For example, there's a page on which you can view a category full of sections, which are in turn full of products. This is listed out in tabular format. Each product has (possibly) many attributes, flags, tier pricing breaks. This must all be represented in the table.
How many queries?
As far as queries are concerned: The category must query all the sections within it, and those sections must query all the products they contain. Not too bad, but each product must then query all it's product attributes, tier pricing, and flags. So, adding more products to a category increases the queries many times over (since I'm currently using the ORM primarily). Having a few hundred products in a section will result in a couple hundred queries. Small queries, but that is still not good.
So far...
All the keys are indexed. I can pull all of the information with a single query (see edit below), however, as you could imagine, this will result in a lot of redundant data spread out across multiple rows per each product, per each extra (e.g.) attribute, flag, etc.
I'm not opposed to ditching the ORM for the displaying part of the application and going with query building or even raw SQL.
The solution for this could be actually be quite simple and I'm just ignorant of it right now, which would be a relief honestly. Or maybe it's not. I'm not sure. If any of my explanation was not adequate enough to understand the problem just ask and I'll try to give a better example. (Edit: Better example given, see below
Although, a side note...
One thing that may have some relevance though: while I always want to have the application designed most efficiently, this isn't a site that's going to be hit dozens or hundreds of times a day. It's more of an administrative application, which probably won't be in use by more than a few individuals at once. I can't foresee too much reloading, as most of the editing of data on the page is done through AJAX. So, should I care as much if on this page it's running a couple hundred queries (fluctuating with how many products are in the currently viewed section) are running each time this particular page is loaded? Just a side thought, even so if it is possible to solve the main aforementioned problem I would prefer that.
Thank you very much!
EDIT
Based on a couple answers, it seems I didn't explain myself adequately. So, let me post an example so you see what's going on.
Before the example though, I should also make two clarifications: (1) there are also a couple many-to-many relationships, (2) and you could possibly liken what I'm looking for to that of a crosstab query.
Let's simplify and say we have 3 main tables:
products (product_id, product_name, product_date_added)
product_attributes (product_attribute_id, product_id, value)
notifications (notification_id, notification_label)
And 1 pivot talbe:
product_notifications (notification_id, product_id)
We're going to list all the products in a table. It's simple enough in the ORM to call all the products.
So per each 'products' we list the product_name and product_date_added. However, we also need to list all the products attributes out. There are a 0 or more of these per product. We also have to show what notifications a product has, of which there are 0 or more as well.
So at the moment, how it works is basically:
foreach ($products->find_all() as $product) //given that $products is an ORM object
{
echo $product->product_id; //lets just pretend these are surrounded by html
echo $product->product_name;
foreach ($products->product_attributes->find_all() as $attribute)
{
echo $attribute->value;
}
foreach ($products->notifications->find_all() as $notification)
{
echo $notification->notification_label;
}
}
This is oversimplified of course, but this is the principle I'm talking about. This works great already. However, as you can see, for each product it must query all of it's attributes to get the appropriate collection or rows.
The find_all() function will return the query results of something along the lines of:
SELECT product_attributes.* FROM product_attributes WHERE product_id = '#', and similarly for the notifications. And it makes these queries for each product.
So, for every product in the database, the number of queries is a few times that amount.
So, although this works well, it does not scale well, as it may potentially result in hundreds of queries.
If I perform a query to grab all the data in one query, along the lines of:
SELECT p.*, pa.*, n.*
FROM products p
LEFT JOIN product_attributes pa ON pa.product_id = p.product_id
LEFT JOIN product_notifications pn ON pn.product_id = p.product_id
LEFT JOIN notifications n ON n.notification_id = pn.notification_id
(Again oversimplified). This gets the data per se, but per each attribute and notification a product has, an extra row with redundant information will be returned.
For example, if I have two products in the database; one has 1 attribute and 1 flag and the other has 3 attributes and 2 flags, it will return:
product_id, product_name, product_date_added, product_attribute_id, value, notification_id, notification_label
1, My Product, 10/10/10, 1, Color: Red, 1, Add This Product
2, Busy Product, 10/11/10, 2, Color: Blue, 1, Add This Product
2, Busy Product, 10/11/10, 2, Color: Blue, 2, Update This Product
2, Busy Product, 10/11/10, 3, Style: New, 1, Add This Product
2, Busy Product, 10/11/10, 3, Style: New, 2, Update This Product
Needless to say that's a lot of redundant information. The number of rows returned per product would be the number of attributes it has times the number of notifications it has.
The ORM (or, just creating the new queries in the loop in general) consolidates all of the information in each row into it's own object, allowing for the data to be handled more logically. That's the rock. Calling the information in one query eliminates the need for possibly hundreds of queries, but creates lots of redundant data in rows and therefore does not return the (one/many)-to-many relationship data in succinct sets. That's the hard place.
Sorry it's so long, trying to be thorough, haha, thanks!
An interesting alternative is to handle your reads and your writes with completely separate models. (Command Query Separation). Sophisticated object models (and ORMS) are great for modeling complex business behavior, but are lousy as interfaces for querying and displaying information to users. You mentioned that you weren't opposed to ditching the ORM for rendering displays -- well, that's exactly what many software architects nowadays suggest. Write a totally different interface (with its own optimized queries) for reading and reporting on data. The "read" model could query the same database that you use with your ORM backed "write" model, or it could be a separate one that is denormalized and optimized for the reports/screens you need to generate.
Check out these two presentations. It may sound like overkill (and it may be if your performance requirements are very low), but it's amazing how this technique makes so many problems just go away.
Udi Dahan: "Command-Query
Responsibility Segregation"
Greg Young: "Unshackle Your
Domain"
A good ORM should handle this for you. If you feel you must do it manually, you can do this.
Fetch all the categories you need in a single query and store the primary key ID's in a PHP array.
Run a query similar to this:
mysql_query('SELECT yourListOfFieldsHere FROM Products WHERE Product_id IN ('.implode(',', $categoryIDs).')');
This should give you all the products that you need in a single query. Then use PHP to map these to the correct categories and display accordingly.