I've been thinking about creating a forum in PHP so I did a little research to see what the standard is for the tables that people create in the database. On most websites I've looked up, they always choose to have one table for the threads and a second for the posts on the threads.
Having a table for the threads seems perfectly rational to me, but one table to hold all the posts on all the threads seems like a little too much. Would it be better to create a table for each thread that will hold that thread's posts instead sticking a few hundred thousand posts in one table?
The tables should represent the structure of the data in your database. If you have 2 objects, which in this case are your threads and your posts, you should put them in 2 tables.
Trust me, it will be a nightmare trying to figure out the right table to show for each post if you do it the way you're thinking. What would the SQL look like? Something like
SELECT *
FROM PostTable17256
and you would have to dynamically construct this query on each request.
However, by using 1 table, you can simply get a ThreadID and pass it as a variable to your query.
SELECT *
FROM Posts
WHERE ThreadID = $ThreadID
Relational databases are designed to have tables which hold lots of rows. You would probably be surprised what DBAs consider to be a "lot" by the way. A table with 1,000,000 rows is considered small to medium in most places.
Nope nope nope. Database love huge tables. Splitting posts into multiple tables will cause many many headaches.
Storing posts in one table is best solution.
MySQL can easily hold millions of rows in a table.
Creating multiple tables may cause few problems.
For example you will not be able to use JOIN with posts from different threads.
Related
I'm creating a classic php blog and have a dilemma about single or two mysql tables approach.
In the first case actual blogs would be placed inside actual table (100 rows max), and archived posts inside archive table (20.000 rows max).
Both tables have the same structure.
Querying on actual table is very often and on archive is not so often.
But sometimes there are join and union queries - covering both tables.
Logically, performances are much better on a smaller table but - is that in my case enough reason to create two tables instead single one?
There is also third solution - single table with two partitions actual - 100 rowsand archive - 20.000 rows.
What to do?
You wrote:
Logically, performances are much better on a smaller table
With respect, your intuition about this is entirely incorrect for tables containing less than about ten million rows. A purpose of SQL is to allow rapid retrieval of a few items from among many. Thousands of years of programmer labor (not an exaggeration) have gone into making this kind of thing very fast. You won't be able to outsmart that collective effort.
Put your items in one table. If you need to distinguish between active and inactive items, make a column called active or some such thing, and retrieve them with WHERE active=1 or some such query term.
If you think you're having performance problems you can add indexes to your tables. Read this. https://use-the-index-luke.com/
While designing databases, don't only think about how you will store the data; but think about all the possible cases:
How are you going to retrieve and update information?
Will there be different views and different permissions for different
people?
In your case, archive seems like a subset of actual table. So a single table would be preferred with a row for keeping track of archived files.
Let's say, we have a MySQL table with user posts. We need to find all post where user_id=1and show them descending by date. But the more posts in the table is, the slower will the search happen, right? What if there are 10000 posts in the table, and we need to find just 3 of them. How long will it take? How to optimize? Can you explain please, how to design data right or just a general conception?
This is a bit long for a comment.
A table with 10,000 rows is not a large table. SQL databases regularly handle queries on tables tens, hundreds, even thousands of times bigger than that.
You need to learn about indexes and partitioning. A good place to start is with the MySQL documentation on these topics.
I am working on a webapplication, which main functionality will be to present some data to user. However, there are several types of these data and each of them have to be presented in a diffrent way.
For example I have to list 9 results - 3 books, 3 authors and 3 files.
Book is described with (char)TITLE, (text)DESCRIPTION.
Author is described with (char)TITLE, (char)DESCRIPTION.
File is described with (char)URL.
Moreover, every type has fields like ID, DATE, VIEWS etc.
Book and Author are presented with simple HTML code, File use external reader embed on the website.
Should I build three diffrent tables and use JOIN while getting these data or build one table and store all types in there? Which attitude is more efficient?
Additional info - there are going to be really huge amounts of records.
The logical way of doing this is keeping things separate, which is following the 3NF rules o the database design. This gives more flexibility while retrieving different kinds of results specially when there is huge amount of data. Putting everything in a single table is absolutely bad DB practice.
That depends on the structure of your data.
If you have 1:1 relationships, say one book has one author, you can put the records in one row. If one book has several authors or one author has several books you should set up seperate tables books and authors and link those with a table author_has_books where you have both foreign keys. This way you won't store duplicate data and avoid inconsistencies.
More information about db normalization here:
http://en.wikipedia.org/wiki/Database_normalization
Separate them and create a relationship. That way, when you start to get a lot of data, you'll notice a performance boost because you are only calling 3 fields at a time (IE when you are just looking at a book) instead of 7.
I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.
It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.
It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.
The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.
You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.
This question has risen on many different occasions for me but it's hard to explain without giving a specific example. So here goes:
Let's imagine for a while that we are creating a issue tracker database in PHP/MySQL. There is a "tasks" table. Now you need to keep track of people who are associated with a particular task (have commented or what not). These persons will get an email when a task changes.
There are two ways to solve this type of situation. One is to create a separate table tasks_participants:
CREATE TABLE IF NOT EXISTS `task_participants` (
`task_id` int(10) unsigned NOT NULL,
`person_id` int(10) unsigned NOT NULL,
UNIQUE KEY `task_id_person_id` (`task_id`,`person_id`)
);
And to query this table with SELECT person_id WHERE task_id='XXX'.
If there are 5000 tasks and each task has 4 participants in average (the reporter, the subject for whom the task brought benefit, the solver and one commenter) then the task_participants table would be 5000*4 = 20 000 rows.
There is also another way: create a field in tasks table and store serialized array (JSON or PHP serialize()) of person_id's. Then there would not be need for this 20 000 rows table.
What are your comments, which way would you go?
Go with the multiple records. It promotes database normalization. Normalization is very important. Updating a serialized value is no fun to maintain. With multiple records I can let the database do the work with INSERT, UPDATE and DELETE. Also, you are limiting your future joins by having a multivalued column.
Definitely do the cross reference table (the first option you listed). Why?
First of all, do not worry about the size of the cross reference table. Relational databases would have been out on their ear decades ago if they could not handle the scale of a simple cross reference table. Stop worrying about 20K or 200K records, etc. In fact, if you're going to worry about something like this, it's better to start worrying about why you've chosen a relational DB instead of a key-value DB. After that, and only when it actually starts to be a problem, then you can start worrying about adding an index or other tuning techniques.
Second, if you serialize the association info, you're probably opaque-ifying a whole dimension of your data that only your specialized JSON-enabled app can query. Serialization of data into a single cell in a table typically only makes sense if the embedded structure is (a) not something that contains data you would never need to query outside your app, (b) is not something you need to query the internals of efficiently (e.g., avg count(*) of people with tasks), and (c) is just something you either do not have time to model out properly or is in a prototypical state. So I say probably above, because it's not usually the case that data worth persisting fits these criteria.
Finally, by serializing your data, you are now forced to solve any computation on that serialized data in your code, which is just a big waste of time that you could have spent doing something more productive. Your database already can slice and dice that data any way you need, yet because your data is not in a format it understands, you need to now do that in your code. And now imagine what happens when you change the serialized data structure in V2.
I won't say there aren't use cases for serializing data (I've done it myself), but based on your case above, this probably isn't one of them.
There are a couple of great answers already, but they explain things in rather theoretical terms. Here's my (essentially identical) answer, in plain English:
1) 20k records is nothing to MySQL. If it gets up into the 20 million record range, then you might want to start getting concerned - but it still probably won't be an issue.
2) OK, let's assume you've gone with concatenating all the people involved with a ticket into a single field. Now... Quick! Tell me how many tickets Alice has touched! I have a feeling that Bob is screwing things up and Charlie is covering for him - can you get me a list of tickets that they both worked on, divided up by who touched them last?
With a separate table, MySQL itself can find answers to all kinds of questions about who worked on what tickets and it can find them fast. With everything crammed into a single field, you pretty much have to resort to using LIKE queries to find the (potentially) relevant records, then post-process the query results to extract the important data and summarize it yourself.