Iam a software engineer (since a few months ready with my study) and for my work i develop a large scalable web application. Another firm does the programming work and makes the database behind it. We defined the data and the relations between them but didn't give a hard database structure they should use.
Now the first (internal) things are visable. I looked in the database structure behind ans saw (in my opinion) something weirds.
For users they created a table users which contains stuff like id, email and password. Near this they created an user_meta table which contains an id, key and value.
When i have a user with:
userid
email
password
name
lastname
address
phone
the id, email and password is stored in the user table. For the other information are rows created in the user_meta table. This means in this case there are 4 rows created for 1 user (in our case are this more than 20 rows for each user) This structure is user for every object what should be saved in the database.
I learned to create a table which contains all the data which is necessary for the user. So in my logic it should be one table with 7 fields...
Extra information:
Our software is build on the Laravel framework (on my advice)
Our software uses MySQL database
Questions:
Is this a normal way to create database for large systems or something like that, becouse i never see this in my study or other projects in my live.
Is this best or bad practice?
In my opinion this is bad for performance. Especially when the user count grows, is this true? or can this be done to get the performance high (becouse not a whole record needs to be selected ?(in our case we expect AT LEAST 100.000 users at the end of 2017 which wil grow faster and faster when our project passes. there is a material chance that we grow far above 1.000.000 users in a few years)
I think the reason why they do it like this is that the "objects" can be changed very easy but in my opinion is it allways posible to add fields to an table and you should NEVER delete fields (also not when it is posible by the structure of the database), is my opinion in this right?
sorry when the questions are "noob questions" but for me this is the first big project in my live so i miss some experience but try to manage the project professionally. We will discuss this wednesday in an meeting. On this way i want to prepare my self a bit in this theme.
In my opinion this is bad for performance. Especially when the user count grows, is this true?
No.
There is a small difference in the insert/update cost which will not be dependent on the volume of data previously accumulated. Retrieval cost for a full record will be higher but slightly reduced for a partial record. At the end of the day, the performance differences are negligible as long as the record is still resolved in a single round trip to the DB (unlike a lot of trivial ORM implementations).
The biggest impact is functional. Its no effort to add, say, an attribute for title to the EAV table, but the normalized table may be knocked offline while the table is stresteched to accomodate the wider records (or the rows are migrated). OTOH you can't enforce a constraint at the database tier like every customer must have an email address.
Is this best or bad practice
Of itself neither. Although not documenting design decisions is bad practice.
far above 1.000.000 users in a few years
a million records is not a lot (unless the database design is really bad).
Related
Iam a software engineer (since a few months ready with my study) and for my work i develop a large scalable web application. Another firm does the programming work and makes the database behind it. We defined the data and the relations between them but didn't give a hard database structure they should use.
Now the first (internal) things are visable. I looked in the database structure behind ans saw (in my opinion) something weirds.
For users they created a table users which contains stuff like id, email and password. Near this they created an user_meta table which contains an id, key and value.
When i have a user with:
userid
email
password
name
lastname
address
phone
the id, email and password is stored in the user table. For the other information are rows created in the user_meta table. This means in this case there are 4 rows created for 1 user (in our case are this more than 20 rows for each user) This structure is user for every object what should be saved in the database.
I learned to create a table which contains all the data which is necessary for the user. So in my logic it should be one table with 7 fields...
Extra information:
Our software is build on the Laravel framework (on my advice)
Our software uses MySQL database
Questions:
Is this a normal way to create database for large systems or something like that, becouse i never see this in my study or other projects in my live.
Is this best or bad practice?
In my opinion this is bad for performance. Especially when the user count grows, is this true? or can this be done to get the performance high (becouse not a whole record needs to be selected ?(in our case we expect AT LEAST 100.000 users at the end of 2017 which wil grow faster and faster when our project passes. there is a material chance that we grow far above 1.000.000 users in a few years)
I think the reason why they do it like this is that the "objects" can be changed very easy but in my opinion is it allways posible to add fields to an table and you should NEVER delete fields (also not when it is posible by the structure of the database), is my opinion in this right?
sorry when the questions are "noob questions" but for me this is the first big project in my live so i miss some experience but try to manage the project professionally. We will discuss this wednesday in an meeting. On this way i want to prepare my self a bit in this theme.
In my opinion this is bad for performance. Especially when the user count grows, is this true?
No.
There is a small difference in the insert/update cost which will not be dependent on the volume of data previously accumulated. Retrieval cost for a full record will be higher but slightly reduced for a partial record. At the end of the day, the performance differences are negligible as long as the record is still resolved in a single round trip to the DB (unlike a lot of trivial ORM implementations).
The biggest impact is functional. Its no effort to add, say, an attribute for title to the EAV table, but the normalized table may be knocked offline while the table is stresteched to accomodate the wider records (or the rows are migrated). OTOH you can't enforce a constraint at the database tier like every customer must have an email address.
Is this best or bad practice
Of itself neither. Although not documenting design decisions is bad practice.
far above 1.000.000 users in a few years
a million records is not a lot (unless the database design is really bad).
First of all, I apologize if a similar question has been asked and answered. I searched and found similar questions, but not one quite close enough.
My question is basically whether or not it is a good idea to separate tables of virtually the same data in my particular circumstance. The tables track data track data for two very different groups (product licensing data for individual users and product licensing data for enterprise users). I am thinking of separating them into two tables so that the user verification process runs faster (especially for individual users since the number of records is significantly lower (eg ~500 individual records vs ~10,000 enterprise records)). Lastly, there is a significant difference in the user types that isn't apparent in the table structure - individual users all have a fixed number of activations while enterprise users may have up to unlimited activations and the purpose of tracking is more for activation stats.
The reason I think separating the tables would be a good idea is because each table would be smaller, resulting in faster queries (at least I think it would...). On the other hand, I will have to do two queries to obtain analytical data. Additionally, I may wish to change the data I am tracking from time to time and obviously, this is more of a pain with two duplicate tables.
I am guessing the query time difference is probably insignificant, even with tens of thousands of records?? However, I would like to hear peoples' thoughts on this (mainly regarding efficiency and overall best practices) if they would be so kind to share.
Thanks in advance!
When designing your database structure you should try to normalize your data as much as possible. So to answer your question
"whether or not it is a good idea to separate tables of virtually the same data in my particular circumstance."
If you normalize your database correctly, the answer is no, it's not a good idea to create two tables with almost identical information. With normalization you should be able to separate out similar data into mapping tables which will allow you to create more complex queries that will run faster.
A very basic example of a first normal form normalization would be you have a table of users, and in the table you have a column for role. Instead of having the physical word "admin" or "member" you have an id that is mapped to another table called roles where 1 = admin and 2 = member. The idea is it is more efficient to store repeated ids rather then repeated words like admin and member.
In the site I am currently working on members can favorite other members. Then when a member goes to their favorites page they can see all the members they have favorited throughout time.
I can go about this in 2 ways:
Method #1:
Every time a user favorites another I enter a row in the favorites table which looks like this (the index is user_favoriting_id):
id | user_favorited_id | user_favoriting_id
-------------------------------------------
Then when they load the "My Favorites" page, I do a select on the favorites table to find all the rows where the user_favoriting_id value equals to that of the present logged in user. I then take the user_favorited_ids to build out a single SELECT statement and look up the respective users from a separate users table.
Method #2:
Each time a user favorites another I will update the favorites field on their row in the users table, which looks something like this (albeit with more fields, the index is id):
id | username | password | email | account_status | timestamp | favorites
--------------------------------------------------------------------------
I will CONCAT the id of the user being favorited in the favorites field so that column will hold a comma separated string like so:
10,44,67 etc...
Then to produce the My Favorites page like method #1 I will just grab all the favorite users with one select. That part is the same.
I know method #1 is the normalized way to do it and is much prettier. But my concern for this particular project is scalability and performance above anything else.
If I choose method #2, it will reduce having to look up on the separate favorites table, since the users table will have to be selected anyway as soon as the user logs in.
And I'm pretty sure using php's explode function to split up those CSV values in method #2 wouldn't take nearly as much time as method #1's additional db look up on the favorites table, but just in case I must ask:
From a purely performance perspective, which of these methods is more optimized?
Also please assume that this website will get a trillion page views a day.
You say that scalability is a concern. This seems to imply that Method #2 won't work for you, because that limits the number of favorites that a user can have. (For example, if you have a million users, then most users will have five-digit IDs. How wide do you want to let favorites be? If it's a VARCHAR(1000), that means that fewer than 200 favorites are allowed.)
Also, do you really expect that you will never want to know which users have "favorited" a given user? Your Method #2 might be O.K. if you know that you will always look up favoritings by "favoriter" rather than "favoritee", but it falls apart completely otherwise. (And even here, it only makes sense if you don't expect to lookup anything meaningful about the "favoritee" aside from his/her user-ID; otherwise, if you actually lookup the "favoritees", then you're basically doing all the hard work of a JOIN, just removing any opportunity for MySQL to do the JOIN intelligently.)
Overall, it's better to start out with best-practices, such as normalization, and to move away from them only when performance requires it. Otherwise something that seems like a performance optimization can have negative consequences, forcing you to write very un-optimal code further down the line.
JOINs take time, but I wouldn't make the change until you have some data that suggests that it's necessary.
Normalization is good for a number of reasons; it's not just an academic exercise.
Concatenation of IDs into a column is a heinous crime against normalization. Don't do it.
You're assuming that your code is faster than all the work that's been done to optimize relational databases. That's a big mistake.
Make sure that you have indexes on primary and foreign keys that participate in the JOINs.
Profile your app when you have real performance issues; don't guess.
Make sure that the real problem isn't your app. Bringing back too much unnecessary information will be more of a performance drag than a normalized schema.
Use Both, One (the normalized approach), is preferable from a data normalization, maintainability, and data integrity perspective, (and for other reasons as well) - you should always strongly favor this approach.
But there's no reason not to use the other approach as well if the normalized approach is not acceptable for read performance. Often an alternative, denormalized approach will be better for read performance. So, use the first one as the "master" for keeping track of the data and for ensuring data integrity, and then keep a denormalized "copy" of the data in the other structure for read access... Update the copy from the master any time it changes, (inserts updates, deletes).
But measure the performance of your alternative approach to ensure it is indeed faster, and by a margin sufficient to justify its use.
As far as I know using dernomalization in mysql is really trivial. but if you'd be using something like not a RDBMS but db like couchdb or mongoDB there's the whole engine how to manipulate data in a safe way. And it's really scalable, non relational database will work for you really faster..
The only method which a prefer for optimizing webapp which uses mysql for example, is to dernomalize table and then give some job to php, and ofcourse using HipHop you will get some really big optimization in there, because you offloaded mysql and loaded php which with HipHop will be optimized up to 50%!
Probably not, but it would totally screw your database up for reasons that others have already cited.
Do NOT use a comma-separated-list-of-IDs pattern. It just sucks.
I strongly suspect that you won't have enough users on your site for this to matter, as unless you're Facebook you are unlikely to have > 1M users. Most of those 1M users won't choose anybody as their favourite (because most will be casual users who don't use that feature).
So you're looking at an extremely small table (say, 1M rows maximum if your 1M users have an average of 1 favourite, although most don't use the feature at all) with only two columns. You can potentially improve the scans in innodb by making the primary key start with the thing that you most commonly want to search by, BUT - get this - you can still add a secondary index on the other one, and get reasonable lookup times (actually, VERY quick as the table will fit into memory on the tiniest server!)
I'm building a web-app with PHP on an Apache server.
The app contains a lot of optional data about persons. Depending on the category of the person (one person can be in may categories), they can choose to specify data or not: home-address (== 5 fields for street, city, country, ...), work-address (again 5 fields), age, telephone number, .... The app stores some additional data too, of course (created, last updated, username, password, userlevel, ...).
The current/outdated version of the app has 86 fields in the "users" table, and is (depending on the category of the person), extended with an additonal table with another 23 fields (1-1 relationship).
All this is stored in a Postgresql database.
I'm wondering if this is the best way to handle this type of data. Most records have (a lot of) empty fields, making the db larger and the queries slower. Is it worth looking into an other solution like a Triple Store, or am I worrying too much about it and should I just keep the current setup? It seems odd and feels awkward to just add fields to a table for every new purpose of the site. On the other hand, I have the impression that triple stores are not that common yet. Any pointers, or suggestions how to approach this?
I've read "Programming the semantic web" by Toby Segaran and others, but from that book I get the impression that the main advantage of triple stores and RDF is the exchange of information over the web (which is not the goal of my app)
Most records have (a lot of) empty fields
This implies that your data is far from normalized.
The current/outdated version of the app has 86 fields in the "users" table, and is (depending on the category of the person), extended with an additonal table with another 23 fields (1-1 relationship).
Indeed, yes, it's a very long way from being normalized.
If you've got a good reason to move away from where you are just now, then the firs step would be to structure your data much better. Even if you choose to move to a different type of DBMS e.g. noSQL or object db.
This does not just save space in your DBMS, it makes retrieving the data faster and reduces the amount of code you need to write (e.g. you can re-use the same code for maintaining a home address as maintaining a work address if you have a single table for 'address' with a field flagging the type of address).
There are lots of resources on the web (in addition to the wikipedia link above) describing how to apply the rules of normalization (it starts getting a little involved after 1,2 and 3 - but if you can master these then you're well equipped to take on most tasks).
Next to your normal user table "user"(user_id/user_email/user_pwd/etc), what is the best way to go to store profile information?
Would one just add fields to the user table like "user"
(user_id/user_email/user_pwd/user_firstname/user_lastname/user_views/etc)
or create another table called "profiles"
(profile_id/user_id/user_firstname/user_lastname/user_views/etc)
or would one go for a table with property definitions and another table to store those values?
I know the last one is the most flexible, as you can add and remove fields easily.
But for a big site (50k users up) would this be fast?
Things to consider with your approaches
Storing User Profile in Users Table
This is generally going to be the fastest approach in terms of getting at the profile data, although you may have a lot of redundant data in here (columns that may not have any information in them).
Quick (especially if you only pull columns you need from the db)
Wasted Data
More difficult to work with / maintain (arguably with interfaces such as PHPMyAdmin)
Storing User Profile in User_Profile Table 1-1 relationship to users
Should still be quite quick with a join and you may eliminate some data redundancy if user profiles aren't created unless a user fills one in.
Easier to work with
Ever so slightly slower due to join (or 2nd query)
Storing User Profile as properties and values in tables
*i.e. Table to store possible options, table to store user_id, option_id and value*
No redundant data stored, all data is relevant
Most normalised method
Slower to retrieve and update data
My impression is that most websites use the 2nd method and store profile information in a second table, its common for most larger websites to de-normalize the database (twitter, facebook) to achieve greater read performance at the expense of slower write performance.
I would think that keeping the profile information in a second table is likely the way to go when you are looking at 50,000 records. For optimum performance you want to keep data that is written heavily seperated from data that is read heavy to ensure cache can work effectively.
Table with property definitions isn't the good idea. I suggest to use three tables to store data:
user(id,login,email,pwd, is_banned, expired, ...)
-- rarely changed, keep small, extremaly fast search, easy to cache, admin data
profile(id, user_id, firstname,lastname, hobby,description, motto)
--data often changed by user,...
user_stats(id,user_id,last_login,first_login,post_counter, visit_counter, comment_counter)
--counters are very often updated, dml invalidate cache
The better way to store authorisation and authentication data is LDAP.
You need way more than 3 tables. How will he store data like multiple emails, multiple addresses, multiple educational histories, multiple "looking for" relationships, etc. Each needs its own row assuming many values will be lookups like city, sex preference, school names, etc. so either normalize it fully or go the noSQL route, no point in hanging in the middle, you will lose the best of both worlds.
you can duplicate rows but it wont be good. social networks do not live with 50,000 users. either you will be successful and have millions of users or you will crash and clsoe it because to run these you need $$$ which will only come if you have a solid user base. With only 50,000 users for life investors wont invest, ad revenues wont cover the cost and you will close it. So design it like you want to be the next facebook right from day one. Think big!