I'm wondering which method below is faster?
Suppose:
Maximum 10,000 products, each product has 1 user id, 1 cat id, 3 extra fields, and 5 images.
90-99% users come to the website just for the information, not posting.
Method 1: get all data from a table from a query without "JOIN":
SELECT * FROM products WHERE ...
Table: products
id | name | poster_name | cat_name | code_1 | code_2 | content |
dimensions | contact | message | images |
Method 2: get all data from multiple tables with "JOIN":
SELECT ... FROM products
LEFT JOIN cats ON products.cat_id = casts.id
LEFT JOIN users ON ....
table: products
id | name | code_1 | code_2 | content | cat_id | poster_id |
table: cats
id | cat_name |
table: users
id | poster_name |
table: extra
id | product_id | extra_info | extra_data |
table: images
id | product_id | img_src |
The first method will usually be faster for reads, and the second one will help you maintain data integrity and usually will be faster for writes.
The transition from the later form to the former is called denormalization and is usually used in data warehouses, while operational ("live") databases usually prefer the later form (second method).
You have not finished asking the question. Method 2 has no WHERE, so it will deliver 10K rows, plus have to do 20K lookups into the other tables. That makes it the loser.
Since your real question is about performance, then let's discuss the WHERE clause. With that, we can optimize it so that the desired data tends to be in RAM.
Back to your question... JOIN is probably the 'right' way to do it. And it is not that much of a performance hit assuming you have the proper indexes. So provide SHOW CREATE TABLE (even if tentative) and complete WHERE clauses.
Don't over-normalize. For example, do not normalize datetime or any other 'continuous' values.
Normalization can save space, especially in huge tables (eg, millions or billions of rows, and large, frequently repeated, strings being normalized.) This is especially helpful when the table is too big to stay cached in RAM.
Related
When I started designing my application database schema few months ago I have been told not to store the same data/calculated data in more than one place in the database(normalization). If I do, I will make a scope of bugs when I update the data in one place and left the other without updating. So I did an orders table and ordersDetails table. Something like this..
-- orders table
+-----+---------+----------+
| ID | clintID | date |
+-----+---------+----------+
| 1 | 1 |2018-02-22|
| 2 | 1 |2018-02-23|
| 3 | 2 |2018-02-24|
+-----+---------+----------+
-- orderDetail table
+-----+---------+------------+----------+----------+
| ID | orderID | itemNumber | quantity | unitPrice|
+-----+---------+------------+----------+----------+
| 1 | 1 | 12345 | 3 | 100.75 |
| 2 | 1 | 12346 | 3 | 100.75 |
| 3 | 2 | 12347 | 3 | 100.75 |
| 4 | 2 | 12345 | 3 | 100.75 |
| 5 | 3 | 12347 | 3 | 100.75 |
| 6 | 3 | 12345 | 3 | 100.75 |
+-----+---------+------------+----------+----------+
And to make the the queries easier for me I made a view "allOrdersSummary" like
-- allOrdersSummary
SELECT
orders.*, SUM(orderDetail.quantity * orderDetail.unitPrice) totalAmount
FROM orders INNER JOIN orderDetail ON orders.ID = orderDetail.orderID
GROUP BY orders.ID;
and I used this view later for my queries, but now I started to get the MAX_JOIN_SIZE error.
So I thought of saving the calculated total order amount along with the orders table ID, clintID, date, totalAmount and whenever I change something in the orderDeatils table I update the calculated totalAmount column in the orders table, I don't know if this is good or bad!
This problem -I don't know if this is considered a problem or not- is encountered many times, for example to know the unread messages of the client making the request I have to do sum(messages) unread from messages where to = ? and isRead = 0
A) should I make another column for calculated totalAmount in the orders table or it is a normal thing in databases to calculate the totalAmount from the orderDetails table every time I need it ?
B) If you recommend making another column in the orders table, what is the best way to update it every time a change happens in the orderDetails table ? should I update it at the PHP layer whenever I update the orderDetails table, or this is something that needs a stored procedure ?
Yes, it is normal to store pre-calculated values, based on other data in the database, in a database. But not necessarily for the reason you mention. I never had a problem with MAX_JOIN_SIZE.
The main, and probably only, reason for storing calculated values is speed. So you do it for values that don't change that often and that may be used in queries that use a lot of data and may therefore be too slow if you didn't use them.
For instance: If you want to know the average value of all the orders in your database the query would be a lot faster if you already have the order totals.
Why, and how, you update the values is completely up to you. However you have got to be consistent about it. If you use the MVC pattern it would make sense to integrate it in the controller. Or in simple terms: Whenever a form is submitted that could change one of the values, out of which the pre-calculated value is computed, you need to recompute it.
This is a clear demonstration where 'normalization' is not entirely maintained. It's not really pretty, but sometimes worth it. You could, of course, argue, that the calculated value represents 'new' information, and therefore does not offend against 'normalization'.
You have an "inflate-deflate" problem.
JOIN the two tables to make a much larger temporary table.
GROUP BY to shrink back to one row per row of the original (orders) table.
This avoids the problem:
SELECT *,
( SELECT SUM(quantity * unitPrice
FROM orderDetail WHERE orderID = orders.ID
) AS totalAmount
FROM orders;
Please let me know how your experience is with this one. It is one of the simplest examples of the inflate-deflate problem.
I'm wondering if there's a best practice when it comes to having multiple many-to-many relationships between the same tables.
Currently I have a many-to-many relationship between user and item for the items that users have created.
---------------------
| user_id | item_id |
---------------------
| 1 | 3 |
---------------------
I'd like to create another junction table for user and item to reference their watchlist. Should I create separate many-to-many tables?
user_item_inventory user_item_watchlist
--------------------- ---------------------
| user_id | item_id | | user_id | item_id |
--------------------- ---------------------
| 1 | 3 | | 2 | 3 |
--------------------- ---------------------
OR should I create one many-to-many table that has a many-to-one relationship with a user_item_type table?
user_item user_item_type
------------------------------- ------------------
| user_id | item_id | type_id | | id | name |
------------------------------- ------------------
| 1 | 3 | 1 | | 3 | inventory |
------------------------------- ------------------
| 2 | 3 | 2 | | 2 | watchlist |
------------------------------- ------------------
While the decision ultimately rests on just how conceptually different inventory and wishlists are, based on prior experience, I would suggest using separate tables.
Currently, you have no additional data attached to either the inventory or watchlist, but that will not necessarily be the case in the future. Without knowing more details about the inventory and watchlist, it's hard to make predictions, but as soon as you want to start tracking additional data on an inventory relation vs. a watchlist relation, having separate tables will make things much simpler. As soon as you want to add columns that only pertain to one of your types of association, you'll want to have separate tables.
As has been pointed out in the other answer, having separate tables is certainly faster from a pure data storage and retrieval standpoint: you'll have one less column/index to populate/filter by. And if your inventory/wishlist associations table becomes "large", those extra type_id references will start to add up to something significant. (It won't matter for smaller sizes, but besides the obvious disk storage factors, more data requires more memory and more cache to manage, especially when indexes are involved.)
Separate tables would be a complication if you need to know all the items a user has an interest in (the combination of inventory, watchlist, and any other similar tables you might create), but if that is an actual need, then you could generate that list easily with a UNION query on all of the tables. (You could even create another table that contains a copy of all the user - item references as a performance enhancement if necessary.)
It depens on how do you want to use these data. If they are logically separated and there is no releations between them, then you can use two tables.
If you're going to display inventory and watchlist records, in some module, and you want only distinguish type of records, then you can use one table with type of record.
It is clear that for database engine, it is better to do
SELECT * FROM user_item_inventory
reather than
SELECT * FROM user_item WHERE type_id = 1
I have a table with all my invoice items as packages:
Table: invoice_items
invoice_item_id | package_id | addon_1 | addon_2 | addon_3 | ...
----------------|------------|---------|---------|
1 | 6 | 2 | 5 | 3 |
Then my other table:
Table: addons
addon_id | addon_name | addon_desc |
----------|--------------|--------------------------|
1 | Dance Lights | Brighten up the party... |
2 | Fog Machine | Add some fog for an e... |
Instead of taking up space storing the addon name in my invoice_items table, I'd like to just include the addon_id in the addon_1, addon_2, etc columns.
How do I then get the name of the addon when doing a query for invoice_item rows?
Right now I just have it programmed into the page that if addon_id == 1, echo "Dance Lights", etc but I'd like to do it in the query. Here is my current query:
$invoice_items_SQL = "
SELECT invoice_items.*, packages.*
FROM `invoice_items`
INNER JOIN packages ON invoice_items.invoice_item_id = packages.package_id
WHERE `event_id` = \"$event_id\"
";
So I'm able to do this with packages, but only because there's just one package_id per row, but there are up to 9 addons :(
The most direct way of doing it is to join onto the table multiple times. That's a bit naff though because you'll write almost the same thing 9 times.
Another, better way would be to restructure your tables - you need another table with 2 data columns: invoice_id and addon_id. You then need either an auto-inc primary column, or use both of those existing columns as a dual primary key. So this is a many-to-many junction table.
From there you can can query without having 9 repetitive joins, but you will get a row of each package for every addon it has (so if it has three addons it will appear three times in the results). And then from there you can use GROUP_CONCAT to concatenate the names of the addons into a single field so that you only get one row per invoice.
Below is a gross over simplification of 2 very large tables I'm working worth.
campaign table
| id | uid | name | contact | pin | icon |
| 1 | 7 | bob | ted | y6w | yuy |
| 2 | 7 | ned | joe | y6e | ygy |
| 3 | 6 | sam | jon | y6t | ouy |
records table
| id | uid | cid | fname | lname | address | city | phone |
| 1 | 7 | 1 | lars | jack | 13 main | lkjh | 55555 |
| 2 | 7 | 1 | rars | jock | 10 maun | oyjh | 55595 |
| 2 | 7 | 1 | ssrs | frck | 10 eaun | oyrh | 88595 |
The page loops thru the records table and prints the results to an HTML table. The existing code, for some reason, does a separate query for each record "select name from campaign where id = $res['cid']" I'd like to get rid of the second query and do a some kind of join but what is the most effective way to do it?
I need to
SELECT * FROM records
and also
SELECT name FROM campaigns WHERE campaigns.id = records.cid
in a single query.
How can I do this efficiently?
Simply join the two tables. You already have the required WHERE condition. Select all columns from one but only one column from the other. Like this:
SELECT records.*, campaigns.name
FROM records, campaigns
WHERE campaigns.id = records.cid
Note that a record row without matching campaign will get lost. To avoid that, rephrase your query like this:
SELECT records.*, campaigns.name
FROM records LEFT JOIN campaigns
ON campaigns.id = records.cid
Now you'll get NULL names instead of missing rows.
The "most efficient" part is where the answer becomes very tricky. Generally a great way to do this would be to simply write a query with a join on the two tables and happily skip away singing songs about kittens. However, it really depends on a lot more factors. how big are the tables, are they indexed nicely on the right columns for the query? When the query runs, how many records are generated? Are the results being ordered in the query?
This is where is starts being a little bit of an art over science. Have a look at the explain plan, understand what is happening, look for ways to make it more efficient or simpler. Sometimes running two subqueries in the from clause that will generate only a subset of data each is much more efficient than trying to join the entire tables and select data you need from there.
To answer this question in more detail, while hoping to be accurate for your particular case will need a LOT more information.
If I was to guess at some of these things in your database, I would suggest the following using a simple join if your tables are less than a few million rows and your database performance is decent. If you are re-running the EXACT query multiple times, even a slow query can be cached by MySQL VERY nicely, so look at that as well. I have an application running on a terribly specc'ed machine, where I wrote a cron job that simply runs a few queries with new data that is loaded overnight and all my users think the queries are instant as I make sure that they are cached. Sometimes it is the little tricks that really pay off.
Lastly, if you are actually just starting out with SQL or aren't as familiar as you think you might eventually get - you might want to read this Q&A that I wrote which covers off a lot of basic to intermediate topcs on queries, such as joins, subqueries, aggregate queries and basically a lot more stuff that is worth knowing.
You can use this query
SELECT records.*, campaigns.name
FROM records, campaigns
WHERE campaigns.id = records.cid
But, it's much better to use INNER JOIN (the new ANSI standard, ANSI-92) because it's more readable and you can easily replace INNER with LEFT or other types of join.
SELECT records.*, campaigns.name
FROM records INNER JOIN campaigns
ON campaigns.id = records.cid
More explanation here:
SQL Inner Join. ON condition vs WHERE clause
INNER JOIN ON vs WHERE clause
SELECT *
FROM records
LEFT JOIN campaigns
on records.cid = campaigns.id;
Using a left join instead of inner join guarantees that you will still list every records entry.
I have one table GAMES and another PLAYERS. Currently each "game" has a column for players_in_game but I have nothing reciprocating in the PLAYERS table. Since this column is an array (Comma separated list of the player's ID #s) I'm thinking that it would probably be better to have each player's record also contain a list of the games they are a member of. On the other hand, duplicating the information in two separate tables might actually require more DB calls.
For perspective, there aren't likely to be more then a dozen players in a game (generally 4-6 is the norm) but there could potentially be a large number of games.
Is there a good way to figure out which would be more efficient?
Thanks.
Normalization is generally a good thing. Comma delimited lists in tables is a sign that a table is in desperate need of a foreign key. If you're worried about extra queries, check out JOINING
dbo.games
+----+----------+
| id | name |
+----+----------+
| 1 | war |
| 2 | invaders |
+----+----------+
dbo.players
+----+----------+---------+
| id | name | game_id |
+----+----------+---------+
| 1 | john | 1 |
| 2 | mike | 1 |
+----+----------+---------+
SELECT games.name, count(players.id) as total_players FROM games INNER JOIN players ON games.id = players.game_id GROUP BY games.name;
Result:
+-----------+--------------+
| name |total_players |
+-----------+--------------+
| war | 2 |
| invaders | 0 |
+-----------+--------------+
Sidenote: Go Hokies :)
Oh god, please don't use CSVs!! I know it's tempting when you're new to SQL, but it becomes unqueryable...
You need 3 tables: games, players, and players_in_games. games and players should each have a primary auto-incrementing key like id, and then players_in_games needs just two fields, player_id and game_id. This is called a "many to many" relationship. A player can play many games, and a game can have many players.
The right answer is a table called PlayersInGames that has a player id and a game id per row.
I would create a third table that links the players and games. Your comma-delimited list is effectively a third table, but parsing your list is almost certainly going to be less efficient than letting the database do it for you.
Ask yourself what happens if you remove a row from the GAME table. Now you'll have to loop over all the PLAYER rows, parse the list, figure out which ones contain a reference to the removed GAME, and then update all the lists.
Bad design. Let SQL do what it was born for. The query will be fast enough if you index it properly. Micro-optimizations like this are the wrong approach.