I'm doing a fairly simple system where users can find computers by searching by option type. I want to search by brand, model, and "options".
Essentially I have 5 tables in this scenario-
brand
model
selection
options_group
options
The selection table is a multi-column lookup table containing:
brand_id
model_id
options_group_id
The options_group table is a lookup table with an ID for "groups of options" and an entry for each option_id.
Basically, the options_group table allows for lots of entries to have the same group of options without storing it more than once.
Right. So. I want to select a specific selection of parts that generates a table:
brand
model
options
where "options" is generated based off the options_group.
My question is this: Do I do this with multiple select statements, where I select just from the selection table first, and then use options_group to do a second select and get all of the options for each row, or do I do a join and get a table with lots of rows?
Before you suggest it, I'm not finding that any of the other answers on SO are answering this exact question.
Or is there some other, better way to do it? I read that joins are orders of magnitude faster than multiple selects, but parsing it at the end could take more time.
use a single statement with select distinct to weed out duplicates. the relational-calculus / relational-algebra that underlies SQL automatically eliminates duplicates as part of the project operator. however, SQL by default does not do so and requires you to use distinct. because underlying relational theory encourages a single statement, and it fits neatly into the operators, i recommend it as a best practice.
with two tables parent (id) and child (id, parent_id, property) do select distinct parent.id from parent join child on parent.id = child.id where child.property in ("X", "Z");
Since you asked for good practice, I'll throw in the fact that this doesn't have to be a db-only solution. It's good practice to cache static/lookup data (sounds like models and/or parts don't change very often) in the app layer or something like memcached, etc, and it will save you the joins and reduce your resultset size.
Related
I'm working on an existing application that uses some JOIN statements to create "immutable" objects (i.e. the results are always JOINed to create a processable object - results from only one table will be meaningless).
For example:
SELECT r.*,u.user_username,u.user_pic FROM articles r INNER JOIN users u ON u.user_id=r.article_author WHERE ...
will yield a result of type, let's say, ArticleWithUser that is necessary to display an article with the author details (like a blog post).
Now, I need to make a table featured_items which contains the columnsitem_type (article, file, comment, etc.) and item_id (the article's, file's or comment's id), and query it to get a list of the featured items of some type.
Assuming tables other than articles contain whole objects that do not need JOINing with other tables, I can simply pull them with a dynamicially generated query like
SELECT some_table.* FROM featured_items RIGHT JOIN some_table ON some_table.id = featured_items.item_id WHERE featured_items.type = X
But what if I need to get a featured item from the aforementioned type ArticleWithUser? I cannot use the dynamically generated query because the syntax will not suit two JOINs.
So, my question is: is there a better practice to retrieve results that are always combined together? Maybe do the second JOIN on the application end?
Or do I have to write special code for each of those combined results types?
Thank you!
a view can be thot of as like a table for the faint of heart.
https://dev.mysql.com/doc/refman/5.0/en/create-view.html
views can incorporate joins. and other views. keep in mind that upon creation, they take a snapshot of the columns in existence at that time on underlying tables, so Alter Table stmts adding columns to those tables are not picked up in select *.
An old article which I consider required reading on the subject of MySQL Views:
By Peter Zaitsev
To answer your question as to whether they are widely used, they are a major part of the database developer's toolkit, and in some situations offer significant benefits, which have more to do with indexing than with the nature of views, per se.
I pull a range (e.g. limit 72, 24) of games from a database according to which have been voted most popular. I have a separate table for tracking game data, and one for tracking individual votes for a game (rating from 1 to 5, one vote per user per game). A game is considered "most popular" or "more popular" when that game has the highest average rating of all the rating votes for said game. Games with less than 5 votes are not considered. Here is what the tables look like (two tables, "games" and "votes"):
games:
gameid(key)
gamename
thumburl
votes:
userid(key)
gameid(key)
rating
Now, I understand that there is something called an "index" which can speed up my queries by essentially pre-querying my tables and constructing a separate table of indices (I don't really know.. that's just my impression).
I've also read that mysql operates fastest when multiple queries can be condensed into one longer query (containing joins and nested select statements, I presume).
However, I am currently NOT using an index, and I am making multiple queries to get my final result.
What changes should be made to my database (if any -- including constructing index tables, etc.)? And what should my query look like?
Thank you.
Your query that calculates the average for every game could look like:
SELECT gamename, AVG(rating)
FROM games INNER JOIN votes ON games.gameid = votes.gameid
GROUP BY games.gameid
HAVING COUNT(*)>=5
ORDER BY avg(rating) DESC
LIMIT 0,25
You must have an index on gameid on both games and votes. (if you have defined gameid as a primary key on table games that is ok)
According to the MySQL documentation, an index is created when you designate a primary key at table creation. This is worth mentioning, because not all RDBMS's function this way.
I think you have the right idea here, with your "votes" table acting as a bridge between "games" and "user" to handle the many-to-many relationship. Just make sure that "userid" and "gameid" are indexed on the "votes" table.
If you have access to use InnoDB storage for your tables, you can create foreign keys on gameid in the votes table which will use the index created for your primary key in the games table. When you then perform a query which joins these two tables (e.g. ... INNER JOIN votes ON games.gameid = votes.gameid) it will use that index to speed things up.
Your understanding of an index is essentially correct — it basically creates a separate lookup table which it can use behind the scenes when the query is executed.
When using an index it is useful to use the EXPLAIN syntax (simply prepend your SELECT with EXPLAIN to try this out). The output it gives show you the list of possible keys available for the query as well as which key the query is using. This can be very helpful when optimising your query.
An index is a PHYSICAL DATA STRUCTURE which is used to help speed up retrieval type queries; it's not simply a table upon a table -> good for a concept though. Another concept is the way indexes work at the back of your text book (the only difference is with your book a search key could point to multiple pages / matches whereas with indexes a search key points to only one page/match). An index is defined by data structures so you could use a B+ tree index and there are even hash indexes. It's Database/Query optimization from the physical/internal level of the Database - I'm assuming that you know that you're working at the higher levels of the DBMS which is easier. An index is rooted within the internal levels and that make DB query optimization much more effective and interesting.
I've noticed from your question that you have not even developed the query as yet. Focus on the query first. Indexing comes after, as a matter of a fact, in any graduate or post graduate Database course, indexing falls under the maintenance of a Database and not necessarily the development.
Also N.B. I have seen quite many people say as a rule to make all primary keys indexes. This is not true. There are many instances where a primary key index would slow up the Database. Infact, if we were to go with only primary indexes then should use hash indexes since they work better than B+ trees!
In summary, it doesn't make sense to ask a question for a query and an index. Ask for help with the query first. Then given your tables (relational schema) and SQL query, then and only then could I advice you on the best index - remember its maintenance. We can't do maintanance if there is 0 development.
Kind Regards,
N.B. most questions concerning indexes at the post graduate level of many computing courses are as follows: we give the students a relational schema (i.e. your tables) and a query and then ask: critically suggest a suitable index for the following query on the tables ----> we can't ask a question like this if they dont have a query
A friend told me that I should include the table name in the field name of the same table, and I'm wondering why? And should it be like this?
Example:
(Table) Users
(Fields) user_id, username, password, last_login_time
I see that the prefix 'user_' is meaningless since I know it's already for a user. But I'd like to hear from you too.
note: I'm programming in php, mysql.
I agree with you. The only place I am tempted to put the table name or a shortened form of it is on primary and foreign keys or if the "natural" name is a keyword.
Users: id or user_id, username, password, last_login_time
Post: id or post_id, user_id, post_date, content
I generally use 'id' as the primary key field name but in this case I think user_id and post_id are perfectly OK too. Note that the post date was called 'post_date" because 'date' is a keyword.
At least that's my convention. Your mileage may vary.
I see no reason to include the table name, it's superfluous. In the queries you can refer to the fields as <table name>.<field name> anyway (eg. "user.id").
With generic fields like 'id' and 'name', it's good to put the table name in.
The reason is it can be confusing when writing joins across multiple tables.
It's personal preference, really, but that is the reasoning behind it (and I always do it this way).
Whatever method you choose, make sure it is consistent within the project.
Personally I don't add table names for field names in the main table but when using it as a foreign field in another table, I will prefix it with the name of the source table. e.g. The id field on the users table will be called id, but on the comments table it, where comments are linked to the user who posted them, it will be user_id.
This I picked up from CakePHP's naming scheme and I think it's pretty neat.
Prefixing the column name with the table name is a way of guaranteeing unique column names, which makes joining easier.
But it is a tiresome practice, especially if when we have long table names. It's generally easier to just use aliases when appropriate. Besides, it doesn't help when we are self-joining.
As a data modeller I do find it hard to be consistent all the time. With ID columns I theoretically prefer to have just ID but I usually find I have tables with columns called USER_ID, ORDER_ID, etc.
There are scenarios where it can be positively beneficial to use a common column name across multiple tables. For instance, when a logical super-type/sub-type relationship has been rendered as just the child tables it is useful to retain the super-type's column on all the sub-type tables (e.g. ITEM_STATUS) instead of renaming it for each sub-type (ORDER_ITEM_STATUS, INVOICE_ITEM_STATUS, etc). This is particularly true when they are enums with a common set of values.
For example, your database has tables which store information about Sales and Human resource departments, you could name all your tables related to Sales department as shown below:
SL_NewLeads
SL_Territories
SL_TerritoriesManagers
You could name all your tables related to Human resources department as shown below:
HR_Candidates
HR_PremierInstitutes
HR_InterviewSchedules
This kind of naming convention makes sure, all the related tables are grouped together when you list all your tables in alphabetical order. However, if your database deals with only one logical group of tables, you need not use this naming convention.
Note that, sometimes you end up vertically partitioning tables into two or more tables, though these partitions effectively represent the same entity. In this case, append a word that best identifies the partition, to the entity name
Actually, there is a reason for that kind of naming, especially when it comes to fields, you're likely to join on. In MySQL at least, you can use the USING keyword instead of ON, then users u JOIN posts p ON p.user_id = u.id becomes users u JOIN posts p USING(user_id) which is cleaner IMO.
Regarding other types of fields, you may benefit when selecting *, because you wouldn't have to specify the list of the fields you need and stay sure of which field comes from which table. But generally the usage SELECT * is discouraged on performance and mainenance grounds, so I consider prefixing such fields with table name a bad practice, although it may differ from application to application.
Sounds like the conclusion is:
If the field name is unique across tables - prefix with table name. If the field name has the potential to be duplicated in other tables, name it unique.
I found field names such as "img, address, phone, year" since different tables may include different images, addresses, phone numbers, and years.
We should define primary keys with prefix of tablename.
We should use use_id instead if id and post_id instead of just id.
Benefits:-
1) Easily Readable
2) Easily differentiate in join queries. We can minimize the use of alias in query.
user table : user_id(PK)
post table : post_id(PK) user_id(FK) here user table PK and post table FK are same
As per documentation,
3) This way we can get benefit of NATURAL JOIN and JOIN with USING
Natural joins and joins with USING, including outer join variants, are
processed according to the SQL:2003 standard. The goal was to align
the syntax and semantics of MySQL with respect to NATURAL JOIN and
JOIN ... USING according to SQL:2003. However, these changes in join
processing can result in different output columns for some joins.
Also, some queries that appeared to work correctly in older versions
(prior to 5.0.12) must be rewritten to comply with the standard.
These changes have five main aspects:
1) The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
2) Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
3) Resolution of column names in NATURAL or USING joins.
4) Transformation of NATURAL or USING joins into JOIN ... ON.
5) Resolution of column names in the ON condition of a JOIN ... ON.
Examples:-
SELECT * FROM user NATURAL LEFT JOIN post;
SELECT * FROM user NATURAL JOIN post;
SELECT * FROM user JOIN post USING (user_id);
I want to list the recent activities of a user on my site without doing too many queries. I have a table where I list all the things the user did with the date.
page_id - reference_id - reference_table - created_at - updated_at
The reference_id is the ID I need to search for in the reference_table (example: comments). If I would do a SELECT on my activity table I would then have to query:
SELECT * FROM reference_table where id = reference_id LIMIT 1
An activity can be a comment, a page update or a subscription. Depending which one it is, I need to fetch different data from other tables in my database
For example if it is a comment, I need to fetch the author's name, the comment, if it is a reply I need to fetch the orignal comment username, etc.
I've looked into UNION keyword to union all my tables but I'm getting the error
1222 - The used SELECT statements have a different number of columns
and it seems rather complicated to make it work because the amount of columns has to match and none of my table has the same amount of tables and I'm not to fond of create column for the fun of it.
I've also looked into the CASE statement which also requires the amount of columns to match if I remember correctly (I could be wrong for this one though).
Does anyone has an idea of how I could list the recent activities of a user without doing too many queries?
I am using PHP and MySQL.
You probably want to split out the different activities into different tables. This will give you more flexiblity on how you query the data.
If you choose to use UNION, make sure that the you use the same number of columns in each select query that the UNION is comprised of.
EDIT:
I was down-voted for my response, so perhaps I can give a better explanation.
Split Table into Separate Tables and UNION
I recommended this technique, because it will allow you to be more explicit about the resources for which you are querying. Having a single table for inserting is convenient, but you will always have to do separate queries to join with other tables to get meaningful information. Also, you database schema will be obfuscated by a single column being a foreign key for different tables depending on the data stored in that row.
You could have tables for comment, update and subscription. These would have their own data which could be queried on individually. If, say, you wanted to look at ALL user activity, you could somewhat easily use a UNION as follows:
(SELECT 'comment', title, comment_id AS id, created FROM comment)
UNION
(SELECT 'update', title, update_id as id, created FROM update)
UNION
(SELECT 'subscription', title, subscription_id as id, created
FROM subscription)
ORDER BY created desc
This will provide you with a listing view. You could then link to the details of each type or load it on an ajax call.
You could accomplish this with the method that you are currently using, but this will actually eliminate the need for the 'reference_table' and will accomplish the same thing in a cleaner way (IMO).
The problem is that UNION should be used just to get similar recordsets together. If you try to unify two different queries (for example, with different columns being fetched) it's an error.
If the nature of the queries is different (having different column count, or data types) you'll need to make several different queries and treat them all separately.
Another approach (less elegant, I guess) would be LEFT JOINing your activities table with all the others, so you'll end up with a recordset with a lot of columns, and you'll need to check for each row which columns should be used depending on the activity nature.
Again, I'd rather stick with the first one, since the second procudes a rather sparse recorset.
With UNION you don't have to get all of the columns from each table, just as long as all of the columns have the same datatypes.
So you could do something like this:
SELECT name, comment as description
FROM Comments
UNION
SELECT name, reply as description
FROM Replies
And it wouldn't matter if Comments and Replies have the same number of columns.
This really depends on the amount of traffic on your site. The union approach is a straightforward and possibly the correct one, logically, but you'll suffer on the performance if your site is heavily loaded since the indexing of a UNIONed query is hard.
Joining might be good, but again, in terms of performance and code clarity, it's not the best of ways.
Another totally different approach is to create an 'activities' table, which will be updated with activity (in addition to the real activity, just for this purpose). In old terms of DB correctness, you should avoid this approach since it will create duplicate data on your system, I, however, found it very useful in terms of performance.
[Another side note about the UNION approach if you decide to take it: if you have difference in parameters length, you can SELECT bogus parameters on some of the unions, for example.. (SELECT UserId,UserName FROM users) UNION (SELECT 0,UserName from notes)
I'm building a movies website... I need to display info about each movie, including genres, actors, and a lot of info (IMDB.com like)...
I created a 'movies' table including an ID and some basic information.
For the genres I created a 'genres' table including 2 columns: ID and genre.
Then I use a 'genres2movies' table with two columns:movieID and the genreID, to connect between the genres and the movies tables...
This way, for example, if a movie have 5 different genres I get the movieID in 5 different rows of the'genres2movies' table. Its better than including the genre each time for each movie but...
There is a better way for doing this???
I need to do this also for actors, languages and countries so performance and database size is really important.
Thanks!!!
It sounds like you are following proper normalisation rules at the moment, which is exactly what you want.
However, you may find that if performance is a key factor you may want to de-normalise some parts of your data, since JOINs between tables are relatively expensive operations.
It's usually a trade-off between proper/full normalisation and performance
You are in the right track. That's the way to do many-to-many relationships. Database size won't grow much because you use integers and for speed you must set up correct indexes for those IDs. When making SELECt queries check out the EXPLAIN - it helps to find the bottlenecks of speed.
You're on exactly the right track - this is the correct, normalized, approach.
The only thing I would add is to ensure that your index on the join table (genres2movies) includes both genre and movie id and it is generally worthwhile (depending upon the selects used) to define indexes in both directions - ie. two indexes, ordered genre-id,movie-id and movie-id,genre-id. This ensures that any range select on either genre or movie will be able to use an index to retrieve all the data it needs and not have to resort to a full table scan, or even have to access the table rows themselves.