I've encountered a table has ~20 fields of 'is_X' (is_active, is_banned, is_allowed_to_view_something and so on) and it seems just plain wrong.
I am familiar with the bitwise method by storing an INT in one field and then breaking it to bits and using it as flags but is there any other way to store a lot of information (most of it is yes/no) in a MySQL table without cluttering the table with tons of fields?
It is good practice to use statuses column. instead of single column for every status. It can be done by serializing object with user's statuses or just simply saving JSON.
Benefits:
easy to manage
your table gets smaller
Bit columns are usefull because easy to use.
An other way could be a right table (I assume you are working on a user table)
user <=> user_has_right <=> right
Table right is supposed to store multiple rows which are active, banned, etc ...
Basically, user_has_right has 2 foreign keys : fk_id_user and fk_id_right.
If an entry exist, then the user has the right to ...
Example :
Suppose you want all active users.
Suppose id_right for active users is 1.
SELECT * FROM user u
INNER JOIN user_has_right uhr ON uhr.fk_id_user = u.id_user
INNER JOIN rights r ON r.id_right = uhr.fk_id_rights
WHERE r.id_right = 1;
MySQL does have a BIT type, but all the bit functions and operators currently require BIGINT (64-bit) arguments and returns. If you can live with the cast to/from BIGINT, you can use any sufficiently large (for your application) integer type and use bitmasks with bitwise ors and ands to set and clear (respectively) individual bits.
Clearly, the intended semantic of each bit-position is not explicit, and so less clear than individual is_A, is_B, ... columns, but you might be able to ameliorate that somewhat with a table of set/clear bitmasks.
It depends on what you're doing, how you use the fields, etc. and (of course) on your own particular preferences, biases, etc. on where the balance lies and whether one is better than the other for a particular situation.
Also, if you Google around, and you'll find that there may be some issues and bugs with BIT in older versions (around 5.0.x) of MySQL.
Related
I have few tables that have columns that can either be ENUM type or INT type. I tend to always use integer type assuming that it will be faster to perform search based on it.
For example one of my table has a column: StatusType which can have only 4 possible values: Completed, In Progress, Failed, Todo.
Instead of storing above as ENUM strings I store them as:
1, 2, 3, 4 respectively. And then in my PHP code I have constant variables that define these values like this:
define('COMPLETED', 1);
define('IN_PROGRESS', 2);
define('FAILED', 3);
define('TODO', 4);
Now my question is, am I doing it right way or I should just change it to be ENUM type and use strings to compare in queries? I have many other columns that can only have set of max 4-5 possible values.
Enum values look really cool in MySQL, yet I am not a fan of them. They are limited to 255 values, so if you decide to add more values, then you might run into a limit. Also, as you describe, you need to synchronize the values in your application code with the values in the database -- something that seems potentially dangerous.
In addition, they make certain future changes more difficult. For instance, other databases do not support enums. And, if you want to add multi-lingual support, having codes embedded in data type definitions in the database is a bit hard to deal with.
The more standard method is one or more reference tables, where you use join to get the values. You can use a hybrid approach where you use a reference table in the database. Then you can load the reference table into the application to get the mapping from numbers to strings so you can avoid the joins in your code.
You are half-correct. Enum is very bad from a performance perspective: MySQL Enum performance advantage?
That said, binding the definitions of the INTs to your code is also not a great thing. Ideally, if you were to follow the correct Data Normalization patterns, you would define the values of the INTs in the Database as well, in another table, and use the index of the definition as the value for the assignment.
See: http://en.wikipedia.org/wiki/Database_normalization#Normal_forms
The reason for this is so the data is portable, and useful without requiring the Codebase to read it (you can easily dump a CSV for Excel by executing a join).
God Speed.
Example SQL:
SELECT *, state.name AS state FROM students
JOIN states ON student.state_id = states.id
Just to get state names.
Or to filter:
SELECT * FROM students
JOIN states ON student.state_id = states.id
WHERE state.name = 'Maine' OR state.code = 'ME'
Yeah, strange example, but the idea is that INTs are TINY, and VARCHAR are... variable... Storing 'Maine' as opposed to '16' adds up over millions of rows. Further, the indexing on INT is MUCH faster than VARCHAR, so your look-ups are going to be much faster. Particularly if you inherently know the number ahead of time and build your query without the JOIN. This is not advisable as a common practice, but could be done if you wanted to make something even faster and you can ensure the validity of the assumed value.
I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.
To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.
1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.
I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.
It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.
It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.
The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.
You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.
I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.
i have a table and one of the columns is co_com
this is communication preferences
there are three options (and only ever will be)
i dont want to have a seperate column for these values
so i was thinking of storing them as
sms/email/fax
sms = yes
email = no
fax = yes
which would be stored as: 101
but,
im thinking thats not the best way
what other ways can you see?
yes i am aware that this is a subjective question
but im not sure how else to ask.
You're correct. That is in fact not the best way.
You say you don't want to have separate columns for these values, but that's exactly what you should be doing.
Storing combinations of logical values as coded binary is... 1900's. Seriously, how much does disk space cost these days, and how much do you save by cramming three bits of information into a single number rather than three bytes or characters?
Go on, create three columns with sensible names, and store either 0's and 1's in them, or if your DB is weird that way, story 'Y' and 'N'. But don't do this binary cleverness stuff. It will bite you eventually when you try to write sensible queries.
In my mind, columns is the best way to go, for ease of use if nothing else. The columns are straight forward and won't be confusing in the future. BUT I wouldn't say storing them in a single column as 3 digits is necessarily bad, just confusing. Save yourself the headaches later and do 3 columns.
Another point of view would be to have another table called com_options for example. Have an ID field and an options field, store all of the different communication options combinations in the options field along with a unique ID in the ID field and in your co_com table have an opt_id field referencing the ID in the com_options table. Then use an INNER JOIN to join these 2 tables together.
If your DB is MySQL, then you can use SET datatype.
It's OK, don't worry -- sometimes we should denormalize tables :)
But if your DB isn't MySQL, then you also can use this method, but implementation will be non-your-DB-native. Also bitwise logic on a big bunch of data works very well vs default normalize d one-to-many relation. Because it`s more computer-oriented.