Optimal database structure design for one to many - php

I am building an inventory tracking system for internal use at my company. I am working on the database structure and want to get some feedback on which design is better*.
I need a recursive(i might be using this term wrong...) system where a part could be made up of zero or more parts. I though of two ways to do this but am not sure which one to use. I am not an expert in database design so maybe there is a their option that i haven't thought of.
Option 1:
Two tables one with the part_id and the other with part_id, sub_part_id (which refers to another part_id) and quantity. so one table part_id would be unique and the other table there could be zero or more rows showing all the parts that make up a certain part.
Option 2:
One table with part_id and assembly. assembly would be a text field that looks something like this, part_id,quantity;part_id,quanity;.... I would then use the PHP explode() function to separate by semi-colon and again by comma to get an array of the sub parts.
I hope this all makes sense. I am using PHP/MySQL.
*community wiki because this may be subjective.

Generally, option 1 is preferable to option 2, not least because some of the part IDs in the assembly would themselves be assemblies.
You do have to deal with recursive or tree-structured queries. That is not particularly easy in any dialect of SQL. Some systems have better support for them than others. Oracle has its CONNECT BY PRIOR system (weird, but it sort of works), and DB2 has recursive WITH clauses, and ...

NEVER, never ever use procedural languages like PHP or C# to process data structures when you have a database engine for that. Relational data structures are much more faster and flexible, and surer, than storing text. Forget about Option 2.
You could use recursive UDFs to retrieve the whole tree with no big fuss about it.

How about a nullable foreign key on the same table? Something like:
CREATE TABLE part (
part_id int not null auto_increment primary key,
parent_part_id int null,
constraint fk_parent_part foreign key (parent_part_id) references part (part_id)
)

Definitely not option 2. That is a recipe for trouble. The correct answer depends on how many potential levels of assemblies are possible, and how you think of the assemblies. Do you think of an assembly (a composite onject consisting of 2 or more atomic parts) as a part in it's own right, that can itself be used as a subpart in anothe assmebly? Or are assemblies a fundementally differrent kind of thing froma an atomic part?
If the former is the case, then put all assemblies and parts in one table, with a PartID, and add a second table that just has the construction details for those parts that are composed of multiple other parts (which themseleves may be assemblies of yet more atomic parts). This second table would look like this:
ConstructionDetails
PartId, SubPartId, QuantityRequired
If you think of things more like the second way, then only put the atomic parts in the first table, and put the assemblies in the second table
Assemblies
AssemblyId, PartId, QuantityRequired

Related

Multilanguage Database: which method is better?

I have a Website in 3 languages.
Which is the best way to structure the DB?
1) Create 3 table, one for every language (e.g. Product_en, Product_es, Product_de) and retrieve data from the table with an identifier:
e.g. on the php page I have a string:
$language = 'en'
so I get the data only
SELECT FROM Product_$language
2) Create 1 table with:
ID LANGUAGE NAME DESCR
and post on the page only
WHERE LANGUAGE = '$language'
3) Create 1 table with:
ID NAME_EN DESCR_EN NAME_ES DESCR_ES NAME_DE DESCR_DE
Thank you!
I'd rather go for the second option.
The first option for me seems not flexible enough for searching of records. What if you need to search for two languages? The best way you can do on that is to UNION the result of two SELECT statement. The third one seems to have data redundancy. It feels like you need to have a language on every names.
The second one very flexible and handy. You can do whatever operations you want without adding some special methods unless you want to pivot the records.
I would opt for option one or two. Which one really depends on your application and how you plan to access your data. When I have done similar localization in the past, I have used the single table approach.
My preference to this approach is that you don't need to change the DB schema at all should you add additional localizations. You also should not need to change your related code in this case either, as language identifier just becomes another value that is used in the query.
That way you would be killing the database in no-time.
Just do a table like:
TABLE languages with fields:
-- product name
-- product description
-- two-letter language code
This will allow you, not only to have a better structured database, but you could even have products that only have one translation. If you want you can even want to show the default language if no other is specified. That you'll do programmatically of course, but I think you get the idea.

Storing database info as array

Which is good practice? To store data as a comma separated list in the database or have multiple rows?
I have a table for accounts, classes, and enrolments.
If the enrolment table has 3 fields: ID, AccountID and ClassID, is it better for ClassID to be a varchar containing a comma separated list such as this: "24,21,182,12" or for it to be just an int and have one entry per enrolment?
tldr: Don't do this. That is, don't use a "packed array" here.
Use a correctly normalized design with "multiple rows". This is likely a good candidate for a Many-to-Many relationship. Consider this structure:
Classes 1:M Enrollments(Class,Student) M:1 Students
Following a properly normalized design will reduce pain. In addition, here are some other advantages:
Referential integrity (use InnoDB)
Consistent model described with relationships
Type enforcement (can't have "foo,,")
JOIN and query without needing custom code
"What are the names of the students in class A?"
"Who is taking more than one class?"
Columns can be useful indexed (query performance)
Generally faster than handling locally in code
More flexible and consistent
Can attach attributes to enrollments such as status
No need to have code to handle serialization at access sites
More accommodating of placeholders and ORMs
Never ever ever cram multiple values into a single database field by combining them with some sort of delimiter, like a comma, or fixed length substrings. In the rare cases where this clearly gives a benefit in storage requirements or performance ... see rule #1: never ever ever. Ever.
When you cram multiple values into a single field, you sabatague all the clever features built into the database engine to help you retrieve and manipulate values.
Like let's say you have this -- I guess it's some sort of student database.
Plan A
student (student_id, account_id, class_id_mash)
Plan B
student (student_id, account_id)
student_class (student_id, class_id)
Okay, lets' say you want a list of all the students taking class #27. With Plan B you write
select student_id
from student join student_class on student.student_id=student_class.student_id
where class_id=27
Easy.
How would you do it with Plan A? You might think
select student_id
from student
where class_id_mash like '%27%'
But that will not only find all students in class 27, but also all those in class 127 or 272.
Okay, how about:
select student_id
from student
where class_id_mash like '%,27,%'
There, now we won't find 127 or 272! But, oops, we also won't find it if the 27 happens to be the first or last one in the list, because then there aren't commas on both sides.
So okay, maybe we could get around that with more rules about delimiters or with a more complex matching expression. But it would be unnecessariliy complex and painful.
And even if we did it, every search for class id has to be a full-fill sequential search. With one value per field and multiple records, you can create an index on the class_id field for fast, efficient retrieval. (Some database engines have ways to index into the middle of text fields, but again, why get into complicated solutions when there's an easy solution?)
How do we validate the class_id's? With separate fields, we can say "class_id references class" and the database engine will insure that we don't enter an illegal value. With the mash, no such free validation.
I have done both, but instead of storing the information in the database as comma seperated, I use another delimiter, such as | (so that I don't worry about formatting on insert into db). Its more about how often you will query the data
If you are only going to need the complete list, it is fine to store it as a comma separated value. But if you need to query the list, they should be stored separately.

Normalized table structure in MySql... Sort of?

I am wondering what thoughts are on the following table structures for MySQL.
I have a relationship between exercises and exercise parameters, where a single exercise can have mutliple parameters.
For example, the exercise 'sit-ups' could have the parameters 'sets' & 'reps'.
All exercises start with a default set of parameters. For example: sets, reps, weight, hold & rest.
This list is fully customizable. Users can add parameters, remove parameters, or rename them, for each exercise in the database.
To express this relationship, I have the following one-to-many structure:
TABLE exercises
ID
Name
Table exerciseParameters
ID
exerciseID -> exercises(ID)
Name
What is concerning me, is that I am noticing that even though users have the option to rename / customize parameters, a lot of the time they dont. So my exerciseParameters table is filling up with repeat words like "Sets" & "Reps" quite a bit.
Is there a better way something like this should be organized, to avoid so much repetition? (Bearing in mind that the names of the parameters have to be user-customizable. For example "Reps" might get changed to "Hard Reps" by the user.) (Or am I making a big deal out of nothing, and this is ok as is?)
Thanks, in advance, for your help.
Unless you are dealing with millions of rows, I'd leave the structure as it is. It is straightforward and easy to query.
If you are dealing with millions of rows and you have measured the storage impact and deem it unacceptable, then you have couple of options (not necessarily mutually exclusive):
Don't store the defaults
If a parameter is not present in exerciseParameters simply assume it has a default value. The actual defaults can be stored in a separate table or outside the database altogether (depending on your querying needs).
If user changes the default parameter, store it in exerciseParameters.
If user deletes the default parameter, represent it as an exerciseParameters row containing a NULL value.
If user restores the default parameter to its original value, remove it from exerciseParameters.
This exploits the assumption that there will be many more unchanged than either edited or deleted defaults. The cost is in increased complexity (in both modification and querying) and potentially performance.
Reorganize you data model
So names (and values) are stored only once, making the repetitions cheaper. For example:
ParameterNameID and ParameterValueID are integers, so each repetition in exerciseParameters is much cheaper (storage-wise) than if they were strings. OTOH, you loose simplicity and potentially pay a price in querying performance (more JOINs needed).
Use a different DBMS
A one that supports clustering and leading-edge index compression (for example, Oracle's ORGANIZATION INDEX COMPRESS table can greatly diminish storage impact of repeated values).
You could add up another table defaultExerciseParams with the default parameters and values. Whenever a user decides to override any of those - remove the param from this table and push it into the Table exerciseParameters

mysql: use SET or lots of columns?

I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.
It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.
It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.
The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.
You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

Categories