I use PHP and mysql.
Let's say I have a database table with 10 000 rows. Which of the cases below it the best performance wise?
Case 1
Two tables, products and categories.
SELECT * FROM products INNER JOIN categories ON products.category_id = categories.id
Products
id
name
category_id
Categories
id
name
Case 2
One table, products, containing all the data.
SELECT * FROM products
Products
id
name
category_name
Question(s)
Which of these cases have the best performance?
Guess, would it take long to get data with 10 000 rows with a structure like it?
Any pitfalls with one of the cases?
From my perspective Case 1 is the "correct" way of doing it, but I will save some developing time by using Case 2. Maybe performance too?
The first is the correct (i.e. SQLish) way of storing this data. It allows you to do the following:
Validate the category names as they are inserted and updated, using standard foreign key relationships.
Change a category name and have it affect all products.
Include other information about a category, such as short names, long descriptions, date added, and so on.
Performance is not the main consideration. The SQL engine takes care of performance through the use of fancy join algorithms and indexes. It does this so you can structure the data in the most sensible and maintainable way for your application.
That said, which performs better depends on a number of factors (how long the category names are, how many different names there are, how wide the product record is). Differences in performance between the two scenarios are probably not at all important in getting an application to work optimally.
Case 1 is better than 2 because if you would implement case 2 you would end up with double data. By double data I mean that you would have multiple times the same value in the "category_name" field. This is bad for two reasons, first because it will slow down performance because of too many, unnecessary data (double data). The second reason is because of efficiency. Suppose you would like to change a category name like drinks to drink it would take way more time in the 2nd case than in the 1st case.
So to answer your first question, case 1 is the way to do it.
And as you can imagine by reading my answer to question one case 1 is faster than case 2 because case 2 has unnecessary data.
And your last question, like I explained in my answer of question one, one pitfall of case 2 is is you would like to change a category name you would end up with way more work than in case 1. Case 1 has by my knowledge no pitfalls.
I think the question id database design centric.
Now answer to your questions:
Which case will give the best performance?
Answer - Case 1.
Why?
It is following the basic SQL rule of Normalization which will help you in longer run.If in future you have more than 10,000 rows then it will be tedious to handle it in the single table with redundant data.
If you do indexing over the key columns, it will help you in executing join queries faster over large number of rows.
Two separate tables will help you in reducing data redundancy.
Why not case 2?
There will be violation of the Normalization rule with the single table.Your example shows it that with the single table it will violate these rule.
Will it take long to get 10,000 rows with a structure like it?
With case 1: It will take a bit long time than the Case 2 as there will be join queries involved.But this time will be negligible and can be reduced by using indexing as well.
With case 2: It will take bit less time than the Case 1 but it's performance may lack due to redundant data or as when the number of records will grow.
Possible pitfalls?
With case 1 -
You may end up writing complex join queries for some difficult scenario.
With case 2 -
Data redundancy / duplication
Low performance in longer run
Poor readability
Hope this help you.
Related
I'm creating a classic php blog and have a dilemma about single or two mysql tables approach.
In the first case actual blogs would be placed inside actual table (100 rows max), and archived posts inside archive table (20.000 rows max).
Both tables have the same structure.
Querying on actual table is very often and on archive is not so often.
But sometimes there are join and union queries - covering both tables.
Logically, performances are much better on a smaller table but - is that in my case enough reason to create two tables instead single one?
There is also third solution - single table with two partitions actual - 100 rowsand archive - 20.000 rows.
What to do?
You wrote:
Logically, performances are much better on a smaller table
With respect, your intuition about this is entirely incorrect for tables containing less than about ten million rows. A purpose of SQL is to allow rapid retrieval of a few items from among many. Thousands of years of programmer labor (not an exaggeration) have gone into making this kind of thing very fast. You won't be able to outsmart that collective effort.
Put your items in one table. If you need to distinguish between active and inactive items, make a column called active or some such thing, and retrieve them with WHERE active=1 or some such query term.
If you think you're having performance problems you can add indexes to your tables. Read this. https://use-the-index-luke.com/
While designing databases, don't only think about how you will store the data; but think about all the possible cases:
How are you going to retrieve and update information?
Will there be different views and different permissions for different
people?
In your case, archive seems like a subset of actual table. So a single table would be preferred with a row for keeping track of archived files.
I've got a MySQL INNODB table containing about 2,000,000 rows with 10 fields (table "cars"). It'll keep increasing progressively at a current rate of about 500,000 rows a year. It's a busy table getting different type of queries on average 2-3 times a second 24/7.
The situation right now is that I need to expand the information to include an INT field ("country_id"). But, this field will for at least 99 % of all rows be default "1".
My question is: Would there be any specific reasons to do either of the following solutions:
Add the INT field to the table and index it ("cars"."country_id")
Add a relational table ("car_countries") which includes the fields "car_id" and "country_id"
I setup these examples in the test environment made a few thousand iterations of querying the tables for data to find this out:
Database/table size will due to the index increase with 19 % (~21 MB)
Queries will take on average 16 % longer (0.37717 secs vs 0.32431 secs for 1,000 queries each)
I've previously tried to keep tables filled with appropriate information for all fields and added relational tables where non-mandatory information was needed for a table but now I've read there's little gain in this as long as there's no need to have arrayed data (which MySQL doesn't handle (and PostgreSQL does)) in the table. In my example a specific car will never be sold to 2 countries so there will never be a need to add more countries to a specific car.
Almost everything is easier with solution 1 and since disk space doesn't really matter. Should I still consider solution 2 anyway? If so, why?
Best regards,
/Thomas
The theoretical answer is that option 1 reflects your underlying relationships - a car can be sold to only one country, and therefore a "many to many" relationship (which option 2 suggests) is not appropriate. It would confuse future developers, and pollutes the data model.
The pragmatic answer is that option 2 doesn't appear to have a dramatic performance improvement today, and - crucially - it's likely to introduce complexity into your code. If 99% of the queries don't need the country data, you either have to write the query to include it (thus negating the performance benefit), or build nasty "if I need country THEN query = xxx ELSE query = yyy" logic.
Finally, apropos the indexing question - MySQL only uses one index for a query, so unless you're writing a query where "country" is in the where clause or being joined on, it's unlikely to have an impact.
Thanks to bwoebi, Raphaƫl Althaus, AgRizzo, Alfons and Ed Gibbs for the input to the question!
Short summary:
Since there can't be two countries to a car and there's only one extra field needed:
Go with Solution 1
Also, an index is probably not needed, check our cardinality and performance on the specific scenario
/Thomas
Excuse me if this has been asked before, but I tried looking for something similar but couldn't find anything.
I have three tables: users, hobbies and user_hobbies (linking the first two). I want to calculate the similarity betweet two users based on their hobbies. For this, I need, first of all, two sets: User A hobbies and user B hobbies, which I can acquire with two simple queries. I have to calculate these two sets for other reasons too, in a php file, so they are available to me, in two arrays, for the next step:
I have to calculate their common hobbies (i.e. the intersection of the sets).
Idea #1: Having two arrays, I can calculate through some method the common elements.
Idea #2: I can make a third query (e.g. SELECT hobby FROM user_hobbies WHERE user_id IN ('uid_A', 'uid_B') GROUP BY hobby HAVING COUNT (*) = 2) and not bother myself.
I suppose my question is about performance. Is it quicker to calculate manually or are mysql queries much faster?
You already have a normalized table to hold the user-hobbies table, so why not go with that?
Generally speaking, SQL will be much faster, at least for the first 100k records or so. Then you'll see a performance drop on queries that vet through columns that aren't indexed, or from queries that use the 'filesort' to order large datasets brought on by the ORDER BY keyword.
For scalability, I recommend using an inner join to narrow down the possibilities for starters.
Think critically about this. Are there any other columns not mentioned could indicate that the user could have more than one hobby? These are the things you consider when looking to scale your application.
Otherwise, you should be fine for starters, lest you should be optimizing prematurely.
I would go with Option #2.
In short: If your operations is NOT a set base operation it is better to be shifted out of the MsSql or any RDBMS.
Because, you can not scale MsSQL easily.
I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.
It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.
It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.
The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.
You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.
I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.