I am developing a MySQL db for a user list, and I am trying to determine the most efficient way to design it.
My issue comes in that there are 3 types of users: "general", "normal", and "super". General and normal users differ only in the values of certain columns, so the schema to store them is identical. However, super users have at least 4 extra columns of info that needs to be stored.
In addition, each user needs a unique user_id for reference from other parts of the site.
So, I can keep all 3 users in the same table, but then I would have a lot of NULL values stored for the general and normal user rows.
Or, I can split the users into 2 tables: general/normal and super. This would get rid of the abundance of NULLs, but would require a lot more work to keep track of the user_ids and ensure they are unique, as I would have to handle that in my PHP instead of just doing a SERIAL column in the single table solution above.
Which solution is more efficient in terms of memory usage and performance?
Or is there another, better solution I am not seeing?
Thanks!
If each user needs a unique id, then you have the answer to your question: You want one users table with a UserId column. Often, that column would be an auto-incremented integer primary key column -- a good approach to the implementation.
What to do about the other columns? This depends on a number different factors, which are not well explained in your question.
You can store all the columns in the same table. In fact, you could then implement views so you can see users of only one type. However, if a lot of the extra columns are fixed-width (such as numbers) then space is still allocated. Whether or not this is an issue is simply a question of the nature of the columns and the relative numbers of different users.
You can also store the extra columns for each type in its own table. This would have a foreign key relationship to the original table, using the UserId. If both these keys are primary keys, then the joins should be very fast.
There are more exotic possibilities as well. If the columns do not need to be indexed, then MySQL 5.7 has support for JSON, so they could all go into one column. Some databases (particularly columnar-oriented ones) allows "vertical partitioning" where different columns in a single table are stored in separate allocation units. MySQL does not (yet) support vertical partitioning.
why not build an extra table; but only for the extra coloumns you need for super users? so 2 tables one with all the users and one with super users's extra info
If you want to have this type of schema. try to create a relation
like:
tb_user > user_id , user_type_id(int)
tb_user_type > user_type_id(int) , type_name
this way you will have just 2 tables and if the type is not set you can set a default value to a user.
Related
I have to create a system to save user's vote for two different type of module: News and Video.
This table should have the same fields:
id
entry_id
vote
user_id
So I tought to add a new field to save also the name of the module (module), in this way I can have just one table in the DB and filter it when needed and create two views for statistic purpose.
I don't really know if the best solution is one table with the new field or is better have two different table.
Let's assume that I have 1000 news and 1000 users and all of them will vote each news I will have 1000000 rows in the table.
Now assume that I have also 1000 videos and also in this case all my users will vote it, other 1000000 rows for an amount of 2000000 rows in a single table.
Do I have any performance problem in this case? And If I will have much more video, news an users?
Operation that I should do:
Insert
Update
Search
If you need more infos please ask
I think the way to answer this question is based on entry_id. The votes are going to be about something and that something is going to reference another table.
So, if you have two separate tables for News and Videos, then you should have two separate votes tables. Neither will have entry_id. One will have news_id and the other video_id.
If you have one table, say Entries for both News and Videos, then have one table.
In other words, I am advising against having one table conditionally reference multiple other tables. It becomes very difficult to express foreign key restraints, for one thing. In addition, join operations are cumbersome to express. Someone else might visit the table and not realize that entry_id can refer to multiple tables, and incorrectly set up queries.
All of these problem can be overcome (and there are situations where one table may be the preferred solution). However, if the original entities are in different tables, then put the votes in different tables.
I am trying to build a robust php function that allows me to traverse over my normalized database. My mySQL database has 6 tables with the following column names (I am only including the primary and foreign keys, as well as some limited table columns for simplicity) so that you can see how they are related.
tableA:
partID (primary key)
tableABJunction
itemID (foreign key)
partID (foreign key)
tableB
itemID (primary key)
itemName
sales
customerID (foreign key)
itemID (foreign key)
partDate
itemID (foreign key)
customer
customerID (primary key)
nameFirst
nameLast
When I need to generate a query, such as: What are the names of the customers that ordered itemID = 12? I have to first do a query from the sales database for all customerIDs where itemID=12 and then query the customer table to find out their first and last names. Some times, I may need to perform a query where I have to return data from all 6 tables, based on a query asking for all information pertaining to customer whose name is John Smith. Is there any easy way to build a function to handle this variety of queries, without having to build a query for every possible type of search?
Currently, my approach is to pass the following to php via AJAX:
web_conditionArray (contains the column name and value of the data provided. Such as nameFirst => 'John', nameLast => 'Smith'); web_resultArray (contains the table name and the columns that I am requesting: sales => 'itemID, itemName').
The issue that I am having with this approach is a way to store the relationships between all of the mySQL datatables with their foreign keys so that my php program knows how to link all the tables together to run the correct query to get from the data provided from one table to the data requested in another table. Any suggestions or a better way to solve this? I was initially thinking of a doubly linked list but the flow from table to table is not linear given that there is a fork where the tableB links to the sales and partDate tables.
I tried to be as specific as I could in describing this situation without writing a novel; however, please let me know if you need any additional information to refine my question further.
Looking at your table structure, I imagine it would be possible to construct logic to calculate the relationships between tables, and dynamically construct queries, but it seems to me that that would be far more work than manually constructing queries for your particular database. I'm assuming that your tables have many more fields in them, but that you've only included the most important, and have definitely included all primary and foreign keys.
Based on that, you have only three information objects in your database: Parts, Items and Customers. You should, therefore, not need more than 12 manually constructed queries to make your system work. You just need to ensure that you simplify your queries to work with whole information objects, and use the PHP layer to filter them later.
So, you reduce your query logic to:
"Fetch me all [Parts, Items or Customers] (and possibly also all [Parts, Items or Customers]) related to [Part, Item or Custromer] (and possibly [Part, Item or Customer])"
This results in the following queries:
All Customers for a Part
All Customers for an Item
All Customers for a Part and an Item
All Items for a Part
All Items for a Customer
All Items for a Part and a Customer
All Parts for an Item
All Parts for a Customer
All Parts for a Customer and an Item
All Parts and Customers for an Item
All Customers and Items for a Part
All Items and Parts for a Customer
(This is the full list of logical relationships - some may not make any sense practically, which makes your life easier)
So, your PHP script needs to perform the following tasks:
Identify which object(s) are required for the criteria of the query. This is based on the fields supplied.
Construct a WHERE clause for your query which identifies the primary key for the criteria objects from the fields passed.
Identify which object(s) are required for the result of the query, based on the fields requested.
Select the query based on the criteria and return objects, and insert the constructed WHERE clause.
Perform the query, extracting all information available about the requested objects
Filter the results, extracting only the required information
Return the final results.
First, know that my answer will most likely be downvoted to hell (as this methodology is constantly downvoted despite its' correctness). DBAs want you to believe that just because a complex query can be done with a SQL statement that it should (like how server-siders think all client-side should be done with server-side or how client-siders think layouts should be done with client-side instead of CSS). No. Complex queries are for people sitting at command lines needing to come up with on demand data grabbing for specific, non-routine reasons. For processing speed, SELECTing, UPDATEing, and DELETEing should always be done off the PK server-side.
It sounds like you have a set of legitimately large tables.
Assuming it's large and speed is the primary concern (and not development time), use only a primary key and no other indexes because the more indexes you have, the more those indexes need to be reindexed by the database when really the comparisons that DBAs would have you do are faster server-side.
The primary key will take some finagling, but it's the most important thing past data types and lengths. For instance, the non-FK, independent tables like tableA, tableB, and customer should probably have an ai INT PK (Generally, remember that computers think in terms of integers), but the ones with multiple FKs should probably have no ai INT but instead a composite PK with the less variant SELECTed FK first. For example, with my site, I store vote totals on links by userID and linkID. If a user's logged in, they'll need to know how many votes they've placed on a link, so the userID is the one less likely to change, so that's first in my PK on that table. Counting this on demand database side or server-side was a performance nightmare.
For just a few lines of code, you will GREATLY improve speed. Sorting on the PK via php will cut latency by 50%. Absorbing JOINs into php will decrease the rate of latency spikes. Having no on demand MySQL calculations will keep your site from becoming paralyzed.
If you step away from the dogma that just because a SQL statement can get you the results that you should use a SQL statement instead of a server-side language (C++ being the fastest), you'll see performance skyrocket.
If you can be more specific with the tables you're trying to obfuscate, I can get more specific, but you probably get the idea.
AJAX has changed the game and forced refocus. CSS for layouts; js for client-side programming; server-side for...server-side processing; database for storing everything that lasts longer than a moment.
Bring on the downvotes! LOL
What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin
The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation
I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.
I have some location data, which is in a table locations with the key being the unique location_id
I have some user data, which is in a table users with the key being the unique user_id
Two ways I was thinking of linking these two together:
I can put the 'location' in each user's data.
'SELECT user_id FROM users WHERE location = "LOCATIONID";'
//this IS NOT searching with the table's key
//this does not require an explode
//this stores 1 integer per user
I can also put the 'userIDs' as a comma delimited string of ids into each location's data.
'SELECT userIDs FROM locations WHERE location_id = "LOCATIONID";'
//this IS searching with the tables key
//this needs an explode() once the comma delimited list is retrieved
//this stores 1 string of user ids per location
so I wonder, which would be most efficient. I'm not really sure how much the size of the data stored could also impact the speed. I want retrievals that are as fast as possible when trying to find out which users are at which location.
This is just an example, and there will be many other tables like location to compare to the users, so the efficiency, or lack of, will be multiplied across the whole system.
Stick with option 1. Keep your database tables normalised as much as possible till you know you have a performance problem.
There's a whole slew of problems with option 2, including the lack of ability to then use the user ID's till you pull them into PHP and then having to fire off lots more SQL queries for each ID. This is extremely inefficient. Do as much inside MySQL as possible, the optimisations that the database layer can do while running the query will easily be a lot quicker than anything you write in PHP.
Regarding your point about not searching on the primary key, you should add an index to the location column. All columns that are in a WHERE clause should be indexed as a general rule. This negates the issue of not searching on the primary key, as the primary key is just another type of index for the purposes of performance.
Use the first one to keep your data normalized. You can then query for all users for a location directly from the database without having to go back to the database for each user.
Be sure to add the correct index on your users table too.
CREATE TABLE locations (
locationId INT PRIMARY KEY AUTO_INCREMENT
) ENGINE=INNODB;
CREATE TABLE users (
userId INT PRIMARY KEY AUTO_INCREMENT,
location INT,
INDEX ix_location (location)
) ENGINE=INNODB;
Or to only add the index
ALTER TABLE users ADD INDEX ix_location(location);
Have you heard of foreign key ?
get details from many tables tables using join .
You can use of sub query also.
As you said there are two tables users and locations.
Keep userid as foreign key in locations and fetch it based on that.
When you store the user IDs as a comma-separated list in a table, that table is not normalized (especially it violates the first normal form, item 4).
It is perfectly valid to denormalize tables for optimization purposes. But only after you have measured that this is where the bottleneck actually is in your specific situation. This, however, can only be determined if you know which query is executed how often, how long they take and whether the performance of the query is critical (in relation to other queries).
Stick with option 1 unless you know exactly why you have to denormalize your table.
Please I don't have any idea. Although I've made some readings on the topic. All I know is it is used to make the data in the database more efficient and easy to handle. And It can also be used to save disk space. And lastly, if you used normalization. You will have to generate more tables.
Now I have a lot of questions to ask.
First, how will normalization help to save disk space or whatever space occupied by the database.
Second, Is it possible to add data on multiple tables using only 1 query.
Please help, I'm just a newbie wanting to learn from you. Thanks.
Ok, couple of things:
php has got nothing to do with this. normalization is about modelling data
normalization is not about saving disk space. It is about organizing data so that it is easily maintainable, which in turn is a way to maintain data-integrity.
normalization is typically described in a few stages or 'normal forms'. In practice, people that design relational databases often intuitively 'get it right' most of the time. But it is still good to be aware of the normal forms and what their characteristics are. There is a lot of documentation on that on the internet (fe http://en.wikipedia.org/wiki/Database_normalization), and you should certainly do you own research, but the most important stages are:
unormalized data: in this stage, data is not truly tabular ('relational'). There is a lot of discussion of what tabular really means, and experts disagree with one another. but most people agree that data is unnormalized in case there are multi-valued attributes (=columns that can for one row contain lists as value), or in case there are repeating groups (=multiple columns or multiple groups of columns for storing the same type of data)
Example of multi-valued column: person (first_name, last_name, phonenumbers)
Here, phonenumbers implies there could be more phonenumbers, stored in one column
Example of repeating group: person(first_name, last_name, child1_first_name, child1_birth_date, child2_first_name, child2_birth_date..., childN_first_name, childN_birth_date)
Here, the person table has a number of column pairs (child_first_name, child_birth_date) to store the person's children.
Note that something like order (shipping_address, billing_address) is not a repeating group: the addresses for billing and shipping may be similar pieces of data, but each has its own distinct role for an order, both just represent a different aspect of an order. child1 thru child10 do not - children do not have specific roles, and the list of children is variable (you never know how many groups you should reserve in advance)
In both cases, multi-valued columns and repeating groups, you basically have "nested table" structure - a table within a table. Data is said to be in 1NF (first normal form) if neither of these occur.
The 1NF is about structural characeristics: the tabular form of the data. All subsequenct normal forms have to do with eliminating redundancy. Redundancy occurs when the same information is independently stored multiple times. Redundancy is bad: if you want to change some fact, you have to change it in multiple places. If you forget to chance one of them, you have inconsistent data - the data is contradicting itself.
There are a lot of processes that can eliminate redundancy, each leading to a higher normal form, all the way from 1nf up to 6nf. However, typically most databases are adequately normalized at 3nf (or a lsight variation of that called boyce-codd normal form, BCNF) You should study 2nf and 3nf, but the principle is very simple: a table is adequately normalized, if:
the table is in 1nf
the table has a key (a column or column combination whose values are required, and which uniquely identifies a row - ie. there can be only one row having that combination of values in the key columns)
there are no functional dependencies between the non-key columns
non-key columns are not functionally dependent upon part of the key (but are completely functionally dependent upon the entire key).
functional dependency means that a column's value can be derived from another column. simple example:
order_item (order_id, item_number, customer_id, product_code, product_description, amount)
let's assume (order_id, item_number) is key. product_code and product description are functionally dependent upon each other: for one particular product_code, you will always find the same product description (as if product description is a function of product_code). The problem is now: suppose a product description changes for a particualr product code, you have to change all orders that us that product_code. forget only one and you have an inconsistent database.
The way to solve it is to create a new product table with (product_code, product_description), having (product_code) as key, and then instead of storing all product fields in order, only store a reference to a row in the product table in the order_item records (in this case, order_item should only keep product_code, which is sufficient to look up a row in the product table and find the product_description)
So as you u can see, with this solution you do actually save space (by not storing all these product descriptions in each order_item that happens to order the product) and you do get more tables (split off product from order_item) But just remember that it is not because of saving diskspace: it is because you eliminate redundancy, thus making it easier to maintain the data. because now you only have to change one row in the product table to change the description
There are a lot of similar questions on StackOverflow already, for example, Can someone please give an example of 1NF, 2NF and 3NF in plain english?
Look in the Related sidebar to the right for a bunch of them. That'll get you started.
As for your specific questions:
Normalization saves disk space by reducing redundant data storage. This has another benefit: if you have multiple copies of a given entity attribute in your database, they can get out of sync, while if you have a normalized database and use referential integrity, this cannot happen.
The INSERT statement references only one table. A TRIGGER on the insert statement can add rows to other tables, but there's no way to supply data to the trigger other than those columns in the table that spawned it.
When you need to insert dependent rows after inserting a row to the parent table, use the LAST_INSERT_ID() function to retrieve the auto-generated primary key value of the last INSERT statement in your session.
I think you will learn this when you start creating the schema for your database.
Please think reverse when you add a field that exists somewhere else in your database.
By reverse I mean, ask yourself: if I have to modify the field, how many queries do I have to run?
Probably you end up, with the answer, that you will have to run 2 or X times the query to modify the content of your column.
Keep it simple, that means assign an ID to each content you have duplicated in your database.
For example taking column address
this is not good
update clients set address = 'new address' where clientid=500;
update orders set address = 'new address' where orderid=300;
good approach would be
create a addresses table
//and run a single query
update addresses set address = 'new address' where addressid=100;
And use the address id 100 everywhere in your database table as a foreign key reference (clients+orders), this way you achieve that the id 100 is not changed, but if you update the content of the address all linked tables will pick up the change.
Level 3 of normalization is enough this time for you.
Normalization is a set of rules. The more you follow, the higher a "level" of normalisation your database has. In general, level 3 is the highest level sought after.
Normalised data is theoretically "purer" than non-normalised data. This makes it easier to rationalise about it, and it removes redundancy, which is reduces the chance of data getting out of sync.
From a pratical viewpoint however, normalised data isn't always the best design, even if it is in theory. If you don't really know the finer points, aiming for normalised data isn't such a bad idea though.
in phpmyadmin > 4.3.0, in structure -> Table structure, we got above the table:
"Print" "Propose table structure" "Track table" "Move columns" "Improve table structure" , in "Improve table structure" you got a wizard which says :
Improve table structure (Normalization):
Select up to what step you want to normalize
First step of normalization (1NF)
Second step of normalization (1NF+2NF)
Third step of normalization (1NF+2NF+3NF)
To question 2: No it is not possible to insert data into multiple tables with one query.
See the INSERT syntax.
In addition to other answers, you can also search here on SO for normalization and find e.g. the question: Normalization in MySQL