In the site I am currently working on members can favorite other members. Then when a member goes to their favorites page they can see all the members they have favorited throughout time.
I can go about this in 2 ways:
Method #1:
Every time a user favorites another I enter a row in the favorites table which looks like this (the index is user_favoriting_id):
id | user_favorited_id | user_favoriting_id
-------------------------------------------
Then when they load the "My Favorites" page, I do a select on the favorites table to find all the rows where the user_favoriting_id value equals to that of the present logged in user. I then take the user_favorited_ids to build out a single SELECT statement and look up the respective users from a separate users table.
Method #2:
Each time a user favorites another I will update the favorites field on their row in the users table, which looks something like this (albeit with more fields, the index is id):
id | username | password | email | account_status | timestamp | favorites
--------------------------------------------------------------------------
I will CONCAT the id of the user being favorited in the favorites field so that column will hold a comma separated string like so:
10,44,67 etc...
Then to produce the My Favorites page like method #1 I will just grab all the favorite users with one select. That part is the same.
I know method #1 is the normalized way to do it and is much prettier. But my concern for this particular project is scalability and performance above anything else.
If I choose method #2, it will reduce having to look up on the separate favorites table, since the users table will have to be selected anyway as soon as the user logs in.
And I'm pretty sure using php's explode function to split up those CSV values in method #2 wouldn't take nearly as much time as method #1's additional db look up on the favorites table, but just in case I must ask:
From a purely performance perspective, which of these methods is more optimized?
Also please assume that this website will get a trillion page views a day.
You say that scalability is a concern. This seems to imply that Method #2 won't work for you, because that limits the number of favorites that a user can have. (For example, if you have a million users, then most users will have five-digit IDs. How wide do you want to let favorites be? If it's a VARCHAR(1000), that means that fewer than 200 favorites are allowed.)
Also, do you really expect that you will never want to know which users have "favorited" a given user? Your Method #2 might be O.K. if you know that you will always look up favoritings by "favoriter" rather than "favoritee", but it falls apart completely otherwise. (And even here, it only makes sense if you don't expect to lookup anything meaningful about the "favoritee" aside from his/her user-ID; otherwise, if you actually lookup the "favoritees", then you're basically doing all the hard work of a JOIN, just removing any opportunity for MySQL to do the JOIN intelligently.)
Overall, it's better to start out with best-practices, such as normalization, and to move away from them only when performance requires it. Otherwise something that seems like a performance optimization can have negative consequences, forcing you to write very un-optimal code further down the line.
JOINs take time, but I wouldn't make the change until you have some data that suggests that it's necessary.
Normalization is good for a number of reasons; it's not just an academic exercise.
Concatenation of IDs into a column is a heinous crime against normalization. Don't do it.
You're assuming that your code is faster than all the work that's been done to optimize relational databases. That's a big mistake.
Make sure that you have indexes on primary and foreign keys that participate in the JOINs.
Profile your app when you have real performance issues; don't guess.
Make sure that the real problem isn't your app. Bringing back too much unnecessary information will be more of a performance drag than a normalized schema.
Use Both, One (the normalized approach), is preferable from a data normalization, maintainability, and data integrity perspective, (and for other reasons as well) - you should always strongly favor this approach.
But there's no reason not to use the other approach as well if the normalized approach is not acceptable for read performance. Often an alternative, denormalized approach will be better for read performance. So, use the first one as the "master" for keeping track of the data and for ensuring data integrity, and then keep a denormalized "copy" of the data in the other structure for read access... Update the copy from the master any time it changes, (inserts updates, deletes).
But measure the performance of your alternative approach to ensure it is indeed faster, and by a margin sufficient to justify its use.
As far as I know using dernomalization in mysql is really trivial. but if you'd be using something like not a RDBMS but db like couchdb or mongoDB there's the whole engine how to manipulate data in a safe way. And it's really scalable, non relational database will work for you really faster..
The only method which a prefer for optimizing webapp which uses mysql for example, is to dernomalize table and then give some job to php, and ofcourse using HipHop you will get some really big optimization in there, because you offloaded mysql and loaded php which with HipHop will be optimized up to 50%!
Probably not, but it would totally screw your database up for reasons that others have already cited.
Do NOT use a comma-separated-list-of-IDs pattern. It just sucks.
I strongly suspect that you won't have enough users on your site for this to matter, as unless you're Facebook you are unlikely to have > 1M users. Most of those 1M users won't choose anybody as their favourite (because most will be casual users who don't use that feature).
So you're looking at an extremely small table (say, 1M rows maximum if your 1M users have an average of 1 favourite, although most don't use the feature at all) with only two columns. You can potentially improve the scans in innodb by making the primary key start with the thing that you most commonly want to search by, BUT - get this - you can still add a secondary index on the other one, and get reasonable lookup times (actually, VERY quick as the table will fit into memory on the tiniest server!)
Related
I am in the process of creating a website where I need to have the activity for a user (similar to your inbox in stackoverflow) stored in sql. Currently, my teammates and I are arguing over the most effective way to do this; so far, we have come up with two alternate ways to do this:
Create a new table for each user and have the table name be theirusername_activity. Then when I need to get their activity (posting, being commented on, etc.) I simply get that table and see the rows in it...
In the end I will have a TON of tables
Possibly Faster
Have one huge table called activity, with an extra field for their username; when I want to get their activity I simply get the rows from that table "...WHERE username=".$loggedInUser
Less tables, cleaner
(assuming I index the tables correctly, will this still be slower?)
Any alternate methods would also be appreciated
"Create a new table for each user ... In the end I will have a TON of tables"
That is never a good way to use relational databases.
SQL databases can cope perfectly well with millions of rows (and more), even on commodity hardware. As you have already mentioned, you will obviously need usable indexes to cover all the possible queries that will be performed on this table.
Number 1 is just plain crazy. Can you imagine going to manage it, and seeing all those tables.
Can you imagine the backup! Or the dump! That many create tables... that would be crazy.
Get you a good index, and you will have no problem sorting through records.
here we talk about MySQL. So why would it be faster to make separate tables?
query cache efficiency, each insert from one user would'nt empty the query cache for others
Memory & pagination, used tables would fit in buffers, unsued data would easily not be loaded there
But as everybody here said is semms quite crazy, in term of management. But in term of performances having a lot of tables will add another problem in mySQL, you'll maybe run our of file descriptors or simply wipe out your table cache.
It may be more important here to choose the right engine, like MyIsam instead of Innodb as this is an insert-only table. And as #RC said a good partitionning policy would fix the memory & pagination problem by avoiding the load of rarely used data in active memory buffers. This should be done with an intelligent application design as well, where you avoid the load of all the activity history by default, if you reduce it to recent activity and restrict the complete history table parsing to batch processes and advanced screens you'll get a nice effect with the partitionning. You can even try a user-based partitioning policy.
For the query cache efficiency, you'll have a bigger gain by using an application level cache (like memcache) with history-per-user elements saved there and by emptying it at each new insert .
You want the second option, and you add the userId (and possibly a seperate table for userid, username etc etc).
If you do a lookup on that id on an properly indexed field you'd only need something like log(n) steps to find your rows. This is hardly anything at all. It will be way faster, way clearer and way better then option 1. option 1 is just silly.
In some cases, the first option is, in spite of not being strictly "the relational way", slightly better, because it makes it simpler to shard your database across multiple servers as you grow. (Doing this is precisely what allows wordpress.com to scale to millions of blogs.)
The key is to only do this with tables that are entirely independent from a user to the next -- i.e. never queried together.
In your case, option 2 makes the most case: you'll almost certainly want to query the activity across all or some users at some point.
Use option 2, and not only index the username column, but partition (consider a hash partition) on that column as well. Partitioning on username will provide you some of the same benefits as the first option and allow you to keep your sanity. Partitioning and indexing the column this way will provide a very fast and efficient means of accessing data based on the username/user_key. When querying a partitioned table, the SQL Engine can immediately lop off partitions it doesn't need to scan as it can tell based off of the username value queried vs. the ability of that username to reside within a partition. (in this case only one partition could contain records tied to that user) If you have a need to shard the table across multiple servers in the future, partitioning doesn't hinder that ability.
You will also want to normalize the table by separating the username field (and any other elements in the table related to username) into its own table with a user_key. Ensure a primary key on the user_key field in the username table.
This majorly depends now on where you need to retrieve the values. If its a page for single user, then use first approach. If you are showing data of all users, you should use single table. Using multiple table approach is also clean but in sql if the number of records in a single table are very high, the data retrieval is very slow
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
Apologies in advance if this is a silly question but I'm wondering which might be faster/better in the following simplified scenario...
I've got registered users (in a users table) and I've got countries (in a countries table) roughly as follows:
USERS TABLE:
user_id (PK, INT) | country_id (FK, TINYINT) | other user-related fields...
COUNTRIES TABLE:
country_id (PK, TINYINT) | country_name (VARCHAR) | other country-related fields...
Now, every time I need to display a user's country, I need to do a MySQL join. However, I often need to do lots of other joins with regard to the users and the big picture seems quite "join-heavy".
I'm wondering what the pros & cons might be of taking the countries out of the database and sticking them into a class as an array, from which I could easily retrieve them with public method calls using country_id? Would there be a speed advantage/disadvantage?
Thanks a lot.
EDIT: Thanks for the all the views, very useful. I'll pick the first answer as the accepted solution although all contributions are valued.
Do you have a serious problem performance problem now? I recently went through a performance improvement on a php/mysql website I developed for my company. Certain areas were too slow, and it turned out a lot of fault was with the queries themselves. I used timers to figure out which queries were slow, and I reorganized them (added indexes, etc). In a few cases, it was faster to make two separate queries and join them in php (I had some pretty complicated joins).
Do not try to optimize until you know you have a problem. Figure out if you have a problem first by measuring it, and then if you need to rearrange your queries you will be able to know if you made an improvement.
It would ease stress on your MySQL server to have less JOIN statements, but not significantly so (there aren't that many countries in the world). However, you'll make up that time in the fact that you'll have to implement the JOIN yourself in PHP. And since you're writing it yourself, you will probably write it less efficiently than the SQL statement, which means that it will take more time. I would recommend keeping it in the SQL server, since the advantages of moving it out are so few (and if the PHP instance and the MySQL instance are on the same box, there are not real advantages).
What you suggest should be faster. Granted, the join probably doesn't cost much, but looking it up in a dictionary should be just about free as far as compute power goes.
This is really just a trade off of memory for speed. The only downsides I could see would of course be the increased memory usage to store the country info and the fact that you would have to invalidate that cache if you ever update the countries table (which is probably not very often).
I don't think you'd gain anything from removing the join, as you'd have to iterate over all your result rows and manually lookup the country name, which I doubt would be quicker than MySQL can do.
I also would not consider such an approach for the following reason: If you want to change the name of a country (say you've got a typo), you can do so just by updating a row in the database. But if the names of the countries are in your PHP code, you'd have to redeploy the code in order to make a change. I don't know PHP, but that might not be as straightforard than a DB change in a production system.
So for maintainability reasons, IMHO let the DB do the work.
The general rule in a database world is to NORMALIZED first (results in more tables) and figure performance issues later.
You will want to DENORMALIZED only for simplicity of code, not for performance. Use indexes and stored procedures. DBMS are designed to optimize on joins.
The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.
I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.
I'm designing a very simple (in terms of functionality) but difficult (in terms of scalability) system where users can message each other. Think of it as a very simple chatting service. A user can insert a message through a php page. The message is short and has a recipient name.
On another php page, the user can view all the messages that were sent to him all at once and then deletes them on the database. That's it. That's all the functionality needed for this system. How should I go about designing this (from a database/php point of view)?
So far I have the table like this:
field1 -> message (varchar)
field2 -> recipient (varchar)
Now for sql insert, I find that the time it takes is constant regardless of number of rows in the database. So my send.php will have a guaranteed return time which is good.
But for pulling down messages, my pull.php will take longer as the number of rows increase! I find the sql select (and delete) will take longer as the rows grow and this is true even after I have added an index for the recipient field.
Now, if it was simply the case that users will have to wait a longer time before their messages are pulled on the php then it would have been OK. But what I am worried is that when each pull.php service time takes really long, the php server will start to refuse connections to some request. Or worse the server might just die.
So the question is, how to design this such that it scales? Any tips/hints?
PS. Some estiamte on numbers:
number of users starts with 50,000 and goes up.
each user on average have around 10 messages stored before the other end might pull it down.
each user sends around 10-20 messages a day.
UPDATE from reading the answers so far:
I just want to clarify that by pulling down less messages from pull.php does not help. Even just pull one message will take a long time when the table is huge. This is because the table has all the messages so you have to do a select like this:
select message from DB where recipient = 'John'
even if you change it to this it doesn't help much
select top 1 message from DB where recipient = 'John'
So far from the answers it seems like the longer the table the slower the select will be O(n) or slightly better, no way around it. If that is the case, how should I handle this from the php side? I don't want the php page to fail on the http because the user will be confused and end up refreshing like mad which makes it even worse.
the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.
I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.
Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).
Then when you do your select statements:
SELECT message FROM DB WHERE recipient = 52;
The speed retrieving rows will be much faster.
Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.
A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.
> But for pulling down messages, my pull.php will take longer as the number of rows
increase! I find the sql select (and delete) will take longer as the rows grow and
this is true even after I have added an index for the recipient field.
UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.
FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.
dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.
My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.
So the question is, how to design this such that it scales? Any tips/hints?
Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.
Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.
Limit the number of rows that your pull.php will display at any one time.
The more data you transfer, longer it will take to display the page, regardless of how great your DB is.
You must limit your data in the SQL, return the most recent N rows.
EDIT
Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.
You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.
EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.