I am lost on how to best approach the site search component. I have a user content site similar to yelp. People can search for local places, local events, local photos, members, etc. So if i enter "Tom" in the search box I expect the search to return results from all user objects that match with Tom. Now the word Tom can be anywhere, like a restaurant name or in the description of the restaurant or in the review, or in someone's comment, etc.
So if i design this purely using normalized sql I will need to join about 15 object tables to scan all the different user objects + scan multiple colunms in each table to search all the fields/colunms. Now I dont know if this is how it is done normally or is there a better way? I have seen stuff like Solr/Apache/Elasticsearch but I am not sure how these fit in to myusecase and even if i use these I assume i still need to scan all the 15 tables + 30-40 colunms correct? My platform is php/mysql. Also any coding / component architecture / DB design practice to follow for this? A friend said i should combine all objects into 1 table but that wont work as you cant combine photos, videos, comments, pages, profiles, etc into 1 table so I am lost on how to implement this.
Probably your friend meant combining all the searchable fields into one table.
The basic idea would be to create a table that acts as the index. One column is indexable and stores words, whereas the other column contains a list of references to objects that contain that word in one of those fields (for example, an object may be a picture, and its searchable fields might be title and comments).
The list of references can be stored in many ways, so you could for example have string of variable length, say a BLOB, and in it store a JSON-encoded array of the ids & types of objects, so that you could easily find them afterwards by doing a search for that id in the table corresponding to the type of object).
Of course, on any addition / removal / modification of indexable data, you should update your index accordingly (but you can use lazy update techniques that eventually update the index in the background - that is because most people expect indexes to be accurate within maybe a few minutes to the current state of the data. One implementation of such an index is Apache Cassandra, but I wouldn't use it for small-scale projects, where you don't need distributed databases and such).
Related
I do not have much experience in table design. My goal is to create one or more product tables that meet the requirements below:
Support many kinds of products (TV, Phone, PC, ...). Each kind of product has a different set of parameters, like:
Phone will have Color, Size, Weight, OS...
PC will have CPU, HDD, RAM...
The set of parameters must be dynamic. You can add or edit any parameter you like.
How can I meet these requirements without a separate table for each kind of product?
You have at least these five options for modeling the type hierarchy you describe:
Single Table Inheritance: one table for all Product types, with enough columns to store all attributes of all types. This means a lot of columns, most of which are NULL on any given row.
Class Table Inheritance: one table for Products, storing attributes common to all product types. Then one table per product type, storing attributes specific to that product type.
Concrete Table Inheritance: no table for common Products attributes. Instead, one table per product type, storing both common product attributes, and product-specific attributes.
Serialized LOB: One table for Products, storing attributes common to all product types. One extra column stores a BLOB of semi-structured data, in XML, YAML, JSON, or some other format. This BLOB allows you to store the attributes specific to each product type. You can use fancy Design Patterns to describe this, such as Facade and Memento. But regardless you have a blob of attributes that can't be easily queried within SQL; you have to fetch the whole blob back to the application and sort it out there.
Entity-Attribute-Value: One table for Products, and one table that pivots attributes to rows, instead of columns. EAV is not a valid design with respect to the relational paradigm, but many people use it anyway. This is the "Properties Pattern" mentioned by another answer. See other questions with the eav tag on StackOverflow for some of the pitfalls.
I have written more about this in a presentation, Extensible Data Modeling.
Additional thoughts about EAV: Although many people seem to favor EAV, I don't. It seems like the most flexible solution, and therefore the best. However, keep in mind the adage TANSTAAFL. Here are some of the disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g. for a lookup table.
Fetching results in a conventional tabular layout is complex and expensive, because to get attributes from multiple rows you need to do JOIN for each attribute.
The degree of flexibility EAV gives you requires sacrifices in other areas, probably making your code as complex (or worse) than it would have been to solve the original problem in a more conventional way.
And in most cases, it's unnecessary to have that degree of flexibility. In the OP's question about product types, it's much simpler to create a table per product type for product-specific attributes, so you have some consistent structure enforced at least for entries of the same product type.
I'd use EAV only if every row must be permitted to potentially have a distinct set of attributes. When you have a finite set of product types, EAV is overkill. Class Table Inheritance would be my first choice.
Update 2019: The more I see people using JSON as a solution for the "many custom attributes" problem, the less I like that solution. It makes queries too complex, even when using special JSON functions to support them. It takes a lot more storage space to store JSON documents, versus storing in normal rows and columns.
Basically, none of these solutions are easy or efficient in a relational database. The whole idea of having "variable attributes" is fundamentally at odds with relational theory.
What it comes down to is that you have to choose one of the solutions based on which is the least bad for your app. Therefore you need to know how you're going to query the data before you choose a database design. There's no way to choose one solution that is "best" because any of the solutions might be best for a given application.
#StoneHeart
I would go here with EAV and MVC all the way.
#Bill Karvin
Here are some of the disadvantages of
EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
All those things that you have mentioned here:
data validation
attribute names spelling validation
mandatory columns/fields
handling the destruction of dependent attributes
in my opinion don't belong in a database at all because none of databases are capable of handling those interactions and requirements on a proper level as a programming language of an application does.
In my opinion using a database in this way is like using a rock to hammer a nail. You can do it with a rock but aren't you suppose to use a hammer which is more precise and specifically designed for this sort of activity ?
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do JOIN
for each attribute.
This problem can be solved by making few queries on partial data and processing them into tabular layout with your application. Even if you have 600GB of product data you can process it in batches if you require data from every single row in this table.
Going further If you would like to improve the performance of the queries you can select certain operations like for e.g. reporting or global text search and prepare for them index tables which would store required data and would be regenerated periodically, lets say every 30 minutes.
You don't even need to be concerned with the cost of extra data storage because it gets cheaper and cheaper every day.
If you would still be concerned with performance of operations done by the application, you can always use Erlang, C++, Go Language to pre-process the data and later on just process the optimised data further in your main app.
If I use Class Table Inheritance meaning:
one table for Products, storing attributes common to all product types. Then one table per product type, storing attributes specific to that product type.
-Bill Karwin
Which I like the best of Bill Karwin's Suggestions.. I can kind of foresee one drawback, which I will try to explain how to keep from becoming a problem.
What contingency plan should I have in place when an attribute that is only common to 1 type, then becomes common to 2, then 3, etc?
For example: (this is just an example, not my real issue)
If we sell furniture, we might sell chairs, lamps, sofas, TVs, etc. The TV type might be the only type we carry that has a power consumption. So I would put the power_consumption attribute on the tv_type_table. But then we start to carry Home theater systems which also have a power_consumption property. OK its just one other product so I'll add this field to the stereo_type_table as well since that is probably easiest at this point. But over time as we start to carry more and more electronics, we realize that power_consumption is broad enough that it should be in the main_product_table. What should I do now?
Add the field to the main_product_table. Write a script to loop through the electronics and put the correct value from each type_table to the main_product_table. Then drop that column from each type_table.
Now If I was always using the same GetProductData class to interact with the database to pull the product info; then if any changes in code now need refactoring, they should be to that Class only.
You can have a Product table and a separate ProductAdditionInfo table with 3 columns: product ID, additional info name, additional info value. If color is used by many but not all kinds of Products you could have it be a nullable column in the Product table, or just put it in ProductAdditionalInfo.
This approach is not a traditional technique for a relational database, but I have seen it used a lot in practice. It can be flexible and have good performance.
Steve Yegge calls this the Properties pattern and wrote a long post about using it.
We are working on a platform which allows the user to create 'promotion' instances, wherein there are an arbitrary amount of pages and 'modules' associated with those pages. Each module has its own customizable collection of attributes. One of the modules I am developing is the entry form, which is the main component.
The form includes some default fields e.g. name, date of birth and email address. The module then allows the user to add as many additional fields of any type that they require (e.g. a '25 words or less' text field, extra opt-in checkboxes, etc).
I'm trying to plan out how I will deal with these X additional fields in terms of storage in MySQL. Sorting and filtering on these fields will be a requirement.
This seems like something that would be a known and solved problem, but I haven't had any luck either wording it correctly or coming across relevant information. I've had some thoughts while searching; but each has a downfall that makes me think there must be a better way:
Create a new table for each form module which contains the additional fields as new rows - this seems really messy / clunky.
Store the additional information in an extra row as JSON (or some other data format). Pull all the data into PHP, expand the JSON and work with all the data in PHP - we envision a high number of entries (5-10k) so I assume this would be too inefficient.
Having an upper limit on additional fields and appending a bunch of rows to the entries table i.e. 'custom1', 'custom2', 'custom3', etc. This seems very messy as well.
Looking again at point 2, I thought there might be a way to take the block of data in the extra row and create a derived table from it, but I haven't had any luck finding information around whether that's possible. For example:
SELECT * FROM( JSON_DECODE(entries.extra) ) ...
If this were possible, that would probably be my preference.
What is the correct way to approach this problem of needing a dynamic amount of additional rows?
The correct solution depends a lot on how you're going to use the data. All solutions have strengths and weaknesses, and you have to understand that you're basically trying to do something that relational databases were not designed for.
I gave a presentation about this topic earlier this year called Extensible Data Modeling, in which I tried to provide a survey of different solutions, and their pros and cons.
You could also throw in the towel, and use a non-relational database to store non-relational data.
I have this situation where i need suggestions on database tables design.
BACKGROUND
I am developing an application in PHP ( cakephp to be precise ). where we upload an xml file, it parses the file and save data in databases. These XML could be files or url feeds and these are purchased from various suppliers for data. It is intended to collect various venues data from source urls , venues can be anything like hotels , cinemas , schools , restaurants etc.
Problem
Initial table structure for these venues is as below . table is deigned to store generic information initially.
id
Address
Postcode
Lat
Long
SourceURL
Source
Type
Phone
Email
Website
With the more data coming from different sources , I realized that there are many attributes for different types of venues.
For example
a hotel can have some attributes like
price_for_one_day, types_of_accommodation, Number_of_rooms etc
where as schools will not have them but have different set of attributes.Restaurant will have some other attributes.
My first idea is to create two tables called vanue_attribute_names , Venue_attributes
##table venue_attribute_names
_____________________________
id
name
##table venue_attributes
________________________
id
venue_id
venue_attribute_name_id
value
So if I detect any new attribute I want to create one and the its value in attributes table with a relation. But I doubt this is not the correct approach. I believe there could be any other approach for this?. Besides if table grows huge there could be performance issues because of increase in joins and also sql queries
Is creating widest possible table with all possible attributes as columns is right approach? Please let me know. If there any links where I could refer I can follow it . Thanks
This is a surprisingly common problem.
The design you describe is commonly known as "Entity/Attribute/Value" or EAV. It has the benefit of allowing you to store all kinds of data without knowing in advance what the schema for that data is. It has the drawback of being hard to query - imagine finding all hotels in a given location, where the daily roomrate is between $100 and $150, whose name starts with "Waldorf". Writing queries against all the attributes and applying boolean logic quickly becomes harder than you'd want it to be. You also can't easily apply database-level consistency checks like "hotel_name must not be null", or "daily_room_rate must be a number".
If neither of those concerns worry you, maybe your design works.
The second option is to store the "common" fields in a traditional relational structure, but to store the variant data in some kind of document - MySQL supports XML, for instance. That allows you to define an XML schema, and query using XPath etc.
This approach gives you better data integrity than EAV, because you can apply schema constraints. It does mean that you have to create a schema for each type of data you're dealing with. That might be okay for you - I'm guessing that the business doesn't add dozens of new venue types every week.
Performance with XML querying can be tricky, and general tooling and the development approach will make it harder to build than "just SQL".
The final option if you want to stick with a relational database is to simply bite the bullet and use "pure" SQL. You can create a "master" table with the common attributes, and a "restaurant" table with the restaurant-specific attributes, a "hotel" table with the hotel attributes. This works as long as you have a manageable number of venue types, and they don't crop up unpredictably.
Finally, you could look at NoSQL options.
If you are sticking with a relational data base, that's it. The options you listed are pretty much what they can give you.
For your situation MongoDB (or an other document oriented NoSql system) could be a good option. This db systems are very good if your have a lot of records with different atributes.
I have recently started doing freelance PHP + MySQL development in my free time, to supplement my income from a full-time job where I write C#/SQL Server code. One of the big database-related differences I've noticed is that MySQL has an enum datatype, whereas SQL Server does not.
When I noticed the enum datatype, I immediately decided to flatten my data model in favor of having a big table that makes use of enumerations rather than many smaller tables for discrete entities and one big "bridge" sort of table.
The website I'm currently working on is for a record label. I only have one table to store the releases for the label, the "releases" table. I have used enumerations everywhere I would normally use a foreign key to a separate table--Artist name, Label name, and several others. The user has the ability to edit these enumeration columns through the backend. The major advantage I see for enumerations over using a text field for this is that artist names will be reused, which should improve data integrity. I also see an advantage in having fewer tables in the database.
Incidentally, I do still have one additional table and a bridge table--there is a "Tags" feature to add tags to a particular release, and since this is a many-to-many relationship, I feel a discrete tag table and a bridge table to join tags to releases is appropriate
Having never encountered an ENUM datatype in a database before, I wonder if I am making wise use of this feature, or if there are problems I haven't foreseen that might come back to bite me as a result of this data architecture. Experienced MySQL'ers, what do you think?
In short, this is not a good design. Foreign keys have a purpose.
From the documentation for the ENUM type:
An enumeration can have a maximum of 65,535 elements.
Your design will not allow you to store more than 65k distinct artist names.
Have you considered what happens when you add a new artist name? I assume you are running an ALTER TABLE to add new enum types? According to a similar SO question this is a very expensive operation. Contrast this with the cost of simply adding another row to the artist table.
What happens if you have more than one table that needs to refer to an artist/artist's name? How do you re-use enum values across tables?
There are many other problems with this approach as well. I think that simplifying your database design like this does you a real disservice (foreign keys or having multiple tables are not a bad thing!).
I'm going to be honest - I stopped when I read...
I have used enumerations everywhere I
would normally use a foreign key to a
separate table--Artist name, Label
name, and several others.
If I understand correctly, that means there is an enumeration of all artists. But that enumeration of artists is definitely going to be a point of variation: there will be more artists. I sincerely doubt the record label never plans on increasing or changing the list of artists ;)
As such, in my opinion, that is an incorrect use of an enumeration.
I also don't think it's appropriate to perform an ALTER TABLE for what is inevitably a rather mundane use case. (Create/Read/Update/Destroy artist) I have no numbers to back up that opinion.
You have to look at it as a question of what information is an entity or an attribute of an entity: for a record label, artists are entities, but media types may not be. Artists have lots of information associated with them (name, genre, awards, web site url, seniority...) which suggests they are an entity, not an attribute of another entity such as Release. Also, Artists are Created/Read/Updated and Destroyed as part of regular everyday use of he system, further suggesting they are entities.
Entities tend to get their own table. Now, when you look at the Media Type of these Releases, you have to ask yourself whether Media Type has any other information... if it's anything more than Name you have a new Entity. For example, if your system has to keep track of whether a media type is obsolete, now there are 2 attributes for Media Type (name, is obsolete) and it should be a separate entity. If the Medai Types only have a Name within the scope of what you're building, then it's an attribute of another entity and should only be a column, not a table. At that point I would consider using an enumeration.
I dont think you can use enumerations in fields like artists. Its like you are restricting your application from growing. It will be really hard to maintain the column. Using ENUM is not a problem its own. But will be an issue in the following situations
When you need to add additional options to the enum colum. If you are table contains lots of data, it will take good time to rebuild your table when adding an additional option
When you need to port the the database to another technology (enum is not available in all database products, for eg MSSQL)
I'm trying to build (right now just thinking/planning/drawing relations :] ) little modular system to build basic websites (mostly to simplify common tasks we as webdesigners do routinely).
I got little stuck with database design / whole idea of storing content.
1., What is mostly painful on most of websites (from my experience), are pages with quasi same layout/skelet, with different information - e.g. Title, picture, and set of information - but, making special templates / special modules in cms happens to cost more energy than edit it as a text - however, here we lose some operational potential - we can't get "only titles", because, CMS/system understands whole content as one textfield
So, I would like to this two tables - one to hold information what structure the content has (e.g. just variable amount of photos <1;500) :], title & text & photo (large) & gallery) - HOW - and another table with all contents, modules and parts of "collections" (my working name for various structured information) - WHAT
table module_descriptors (HOW)
id int
structure - *???*
table modules (WHAT)
id int
module_type - #link to module_descriptors id
content - *???*
2., What I like about this is - I don't need many tables - I don't like databases with 6810 tables, one for each module, for it's description, for misc. number to text relations, ... and I also don't like tables with 60 columns, like content_us, content_it, category_id, parent_id.
I'm thinking I could hold the structure description and content itself (noted the ??? ?) as either XML or CSV, but maybe I'm trying to reinvent the wheel and answer to this is hidden in some design pattern I haven't looked into.
Hope I make any sense at all and would get some replies - give me your opinion, pros, cons... or send me to hell. Thank you
EDIT: My question is also this: Does this approach make sense? Is it edit-friendly? Isn't there something better? Is it moral? Don't do kittens die when I do this? Isn't it too much for server, If I want to read&compare 30 XMLs pulled from DB (e.g. I want to compare something)? The technical part - how to do it - is just one part of question:)
The design pattern you're hinting at is called Serialized LOB. You can store some data in the conventional way (as columns) for attributes that are the same for every entry. For attributes that are variable, format them as XML or MarkDown or whatever you want, and store it in a TEXT BLOB.
Of course you lose the ability to use SQL expressions to query individual elements within the BLOB. Anything you need to use in searching or sorting should be in conventional columns.
Re comment: If your text blob is in XML format, you could search it with XML functions supported by MySQL 5.1 and later. But this cannot benefit from an index, so it's going to result in very slow searches.
The same is true if you try to use LIKE or RLIKE with wildcards. Without using an index, searches will result in full table-scans.
You could also try to use a MySQL FULLTEXT index, but this isn't a good solution for searching XML data, because it won't be able to tell the difference between text content and XML tag names and XML attributes.
So just use conventional columns for any fields you want to search or sort by. You'll be happier that way.
Re question: If your documents really require variable structure, you have few choices. When used properly, SQL assumes that every row has the same structure (that is, columns). Your alternatives are:
Single Table Inheritance or Concrete Table Inheritance or Class Table Inheritance
Serialized LOB
Non-relational databases
Some people resort to an antipattern called Entity-Attribute-Value (EAV) to store variable attributes, but honestly, don't go there. For a story about how bad this can go wrong, read this article: Bad CaRMa.