Why forums database are not in 3rd NF?

Why forums database are not in 3rd NF? - php

I was taking a look at some important forums such as SMF Forums, PhpBB, or VBulleting ones and i realized they are not in 3rd FN.
They have many NULL fiels, for example, in an SMF forum a member row can have all of this columns to NULL:
pm_ignore_list, messageLabels, personalText, websiteTitle, websiteUrl, location, ICQ, AIM, YIM, MSN, timeFormat, userTitle, notifyAnnouncements, secretQuestion, secretAnswer, validation_code, additionalGroups, smileySet
So... lets say 18 fields which can be NULL in any ROW of the table.
That's not 3rd NF...
Why they do it? I am sure they know much about BD...
Thanks.

The number one reason for denormalization is performance, which is a notorious problem with many discussion forums.
Originally SQL was not designed to store hierarchical data easily, and there are many less-than optimal schema designs trying to work around this limitation.

One or more of these reasons might apply.
The database wasn't "designed" at all; it gradually accumulated more and more columns as any programmer who worked on it decided to add one. (Programmers are often only minimally trained in database design.)
The "design", such as it is, is the result of committee decisions. (See above.)
The "design" was known to be not the best idea, but was implemented in order to get the software to ship. The underlying fantasy is usually to fix it properly before the next release. (Often never gets fixed.)
The table was denormalized for faster SELECT performance. In my experience, though, SELECT speed usually suffers more from a) the overuse of ID numbers and b) misunderstanding normalization than from high degrees of normalization.

Related

MySQL many tables or few tables

I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)

This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.

Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)

nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...

I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.

I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)

I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)

the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.

Which is better database design?

Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?

Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.

The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.

I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.

I agree with #Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
If you are concerned about reporting the data then don't do that on a transactional system. Create a reporting DB and update that periodically.

The "correct" way to design is to use another table, join it and COUNT. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot JOIN anymore; how can you JOIN against half a table and receive meaningful results? And obviously, trying to JOIN against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.

Drupal convert nid/vid from int to bigint

I am busy with a project where the nid and vid values may reach its limit. I need a mechanism to modify current and future nid and vid data types from int to bigint.
I figured maybe there was a schema alter hook, or something limilar. I see there is a hook called hook_schema_alter.
How reliable will it be to build a module that simple checks for nid and vid in the schema, and modifies it to be a bigint? Would this be a practical way of solving the problem? Will it work with all content types, module ones and cck?
G.

As hook_schema_alter will only be fired on module install, rather than build a complex module that manages this automatically, you should pick a subset of modules that you know you will be using, install them, and manually update the schema.
If you are going to have 4 billion nodes (other poster said 2bn, but nid is unsigned which doubles the available range) you really should not be turning modules on and off at random. Your architecture should be rock solid and planned out well in advance.
Also, what's your use case for wanting that many nodes in Drupal? Any kind of database operation with that many rows is going to be very, very intensive even when fully optimized and without the weight of the Drupal stack (and it's love of expensive JOIN queries) on top of it.
Drupal will be fine for prototyping whatever you're building but by the time you hit xxx,000 nodes you'll already be spending the majority of your time hand-tuning everything for performance. You may get x,000,000 nodes if you have serious world-class expertise and funding. For anything more, you will probably want to start looking at offloading that data into a database system that is specifically optimized for huge datasets and then access it from Drupal as a service.
Take a look at Hadoop and Cassandra for examples of DBMS' that can scale to billions of items (Google, Facebook, Twitter etc use them).

If your nid/vids are going to get past 4 billion you might have some other issues to deal with before you care about this :) Also since you are in D6 if this isn't like say 200,000,000 pieces of content & 20 revisions, but rather something else like stock price change information or something I would store it in it's own table.

Speed of data manipulation in PHP vs MySQL

Apologies in advance if this is a silly question but I'm wondering which might be faster/better in the following simplified scenario...
I've got registered users (in a users table) and I've got countries (in a countries table) roughly as follows:
USERS TABLE:
user_id (PK, INT) | country_id (FK, TINYINT) | other user-related fields...
COUNTRIES TABLE:
country_id (PK, TINYINT) | country_name (VARCHAR) | other country-related fields...
Now, every time I need to display a user's country, I need to do a MySQL join. However, I often need to do lots of other joins with regard to the users and the big picture seems quite "join-heavy".
I'm wondering what the pros & cons might be of taking the countries out of the database and sticking them into a class as an array, from which I could easily retrieve them with public method calls using country_id? Would there be a speed advantage/disadvantage?
Thanks a lot.
EDIT: Thanks for the all the views, very useful. I'll pick the first answer as the accepted solution although all contributions are valued.

Do you have a serious problem performance problem now? I recently went through a performance improvement on a php/mysql website I developed for my company. Certain areas were too slow, and it turned out a lot of fault was with the queries themselves. I used timers to figure out which queries were slow, and I reorganized them (added indexes, etc). In a few cases, it was faster to make two separate queries and join them in php (I had some pretty complicated joins).
Do not try to optimize until you know you have a problem. Figure out if you have a problem first by measuring it, and then if you need to rearrange your queries you will be able to know if you made an improvement.

It would ease stress on your MySQL server to have less JOIN statements, but not significantly so (there aren't that many countries in the world). However, you'll make up that time in the fact that you'll have to implement the JOIN yourself in PHP. And since you're writing it yourself, you will probably write it less efficiently than the SQL statement, which means that it will take more time. I would recommend keeping it in the SQL server, since the advantages of moving it out are so few (and if the PHP instance and the MySQL instance are on the same box, there are not real advantages).

What you suggest should be faster. Granted, the join probably doesn't cost much, but looking it up in a dictionary should be just about free as far as compute power goes.
This is really just a trade off of memory for speed. The only downsides I could see would of course be the increased memory usage to store the country info and the fact that you would have to invalidate that cache if you ever update the countries table (which is probably not very often).

I don't think you'd gain anything from removing the join, as you'd have to iterate over all your result rows and manually lookup the country name, which I doubt would be quicker than MySQL can do.
I also would not consider such an approach for the following reason: If you want to change the name of a country (say you've got a typo), you can do so just by updating a row in the database. But if the names of the countries are in your PHP code, you'd have to redeploy the code in order to make a change. I don't know PHP, but that might not be as straightforard than a DB change in a production system.
So for maintainability reasons, IMHO let the DB do the work.

The general rule in a database world is to NORMALIZED first (results in more tables) and figure performance issues later.
You will want to DENORMALIZED only for simplicity of code, not for performance. Use indexes and stored procedures. DBMS are designed to optimize on joins.
The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.

Need Help regarding Optimization

First of all I am an autodidact so I don't have great know how about optimization and stuff. I created a social networking website.
It contains 29 tables right now. I want to extend its functionality by adding things like yellow pages, events etc to make it more like a portal.
Now the question is should I simply add the tables in the same database or should I use a different database?
And in case if I create a new database, I also want users to be able to comment on business listing etc just like reviews. So how will I be able to pull out entries since the reviews will be on one database and user details on other.
Is it possible to join tables on 2 different databases ?

You can join tables in separate databases by fully justifying the name, but the real question is why do you want the information in separate databases? If the information you are storing all relates together, it should go in one database unless there is a compelling (usually performance related) reason against it.
The main reason I could see for separating your YellowPages out is if you wished to have one YellowPages accessible to several different, non-interacting, websites. That said, assumably you wouldn't want cross-talk comments on the listings, so comments would need to be stored in the website databases rather than the YellowPages database. And that just sounds like a maintenance nightmare.

Don't Optimize until you need to.
If performance is ok, go for the easiest to maintain solution.
Monitor the performance of your site and if it starts to get slow, figure out exactly what is causing the slowdown and focus on performance on that section only.

You definitely can query and join tables from two different databases - you just need to specify the tables in a dbname.tablename format.
SELECT a.username, b.post_title
FROM dbOne.users a INNER JOIN dbTwo.posts b USING (user_id)
However, it might make management and maintenance a lot more complicated for you. For example, you'll have to track which table belongs in which database, and will continually need to be adding the database names into all your queries. When it comes time to back up the data, your work will increase there as well. MySQL databases can easily contain hundreds of tables so I see no benefit in splitting it up - just stick with one.

You can prove an algorithm is the fastest it can. math.h and C libraries are very optimized since half a century and other very advances when optimizing is perl strucutres. Just avoid put everything on online to easify debugging. There're conventions, try keep every programmer in the team following same convention. Which convention is "right" makes less optimum than being consequent and consistent. Performance is the last thing you do, security and intelligibility top prios. Read about ordo notation depends on software only while suboptimal software can be faster than optimal relative different hardware. A totally buginfested spaghetti code with no structure can respond many times faster than the most proven optimal software relative hardware.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why forums database are not in 3rd NF? - php

The number one reason for denormalization is performance, which is a notorious problem with many discussion forums. Originally SQL was not designed to store hierarchical data easily, and there are many less-than optimal schema designs trying to work around this limitation.

Related

MySQL many tables or few tables

Which is better database design?

Drupal convert nid/vid from int to bigint

Speed of data manipulation in PHP vs MySQL

Need Help regarding Optimization

Categories

Resources