Im doing some redesigning of our database im building new tables to hold data about files users uploaded. The overlying issue here is that there are a bunch of different types of files users may upload. They may for example upload an mp3 file as a song, a profile picture, a profile cover photo, etc. Im running into a few design and practical issues, though, and am trying to figure out the best to do this. At the moment the main design looks something like this:
ID | name | type | amazon_S3_info
ID: auto-increment an ID for every new upload.
name: name of the upload i.e. file name for example
type: what type of upload it is, for example a profile picture, a cover photo, audio file etc.
amazon_S3_info: Im storing all the files in an S3 and this field holds the data so I can generate the URL. I can't store a URL here, since im using signed urls and they always need to be regenerated with the data stored in this field.
After creating a table like this I could then just make matching tables where I, for example, create the relationships between a user ID and the upload ID of the profile picture they uploaded, etc. which is pretty simple.
My original idea was to break this whole thing up into multiple tables, meaning I would make 1 table for profile pictures, 1 for cover photos, etc. The reason this would become a bit of a headache on the php-side is that I have one standard function which uses an ID to retrieve the file URL for these files. If I have multiple tables then each type of upload would have 1 of the same ID in it, thus rendering my current URL retrieval useless. This is already in use all over the site and would be a nuisance to redo, however if it needs to it needs to be.
To be clear the idea here to break up into a few tables was speed. My logic is that it would more efficient to break a single table that could be 2,000,000 rows into a 4 tables of 500,000. It would be quicker to pull data from each one of those 500,000 rows table, or is that a false premise?
So my question to you lot is which database design is better, particularly when we are talking about scaling to be quite large?
With databases (and computers in general), you're usually worried about factors of 10, not just 2x or 3x.
So splitting up the table by type into multiple tables, say 5 tables altogether instead of 1, ultimately won't solve your performance issues once the data grows extremely large. And like you said, it's a programming pain. (Basically you'd be sharding manually without an algorithm...if going to shard might as well use a hash shard algorithm to find the database/table).
The design you have is standard many to many. Index the tables correctly and that's the best you can do.
If performance becomes a problem, you need to scale horizontally. Relational datastores don't do this well, but NoSQL datastores do. You can have those types of references in NoSQL as well. If design changes are still possible, look into AWS DynamoDB (NoSQL service).
Edit: to respond to a comment...
#arian1123 In my experience, there's a point (table size) where all of a sudden mysql starts performing poorly. The more hardware (especially memory) you have, the larger the tables can grow before this happens. (The killer are joins. If you don't join big tables on big tables, then a big table by itself probably can grow very large with adequate hardware, I've dealt with 1Billion+ row tables where only reads were done with no joins, and it wasn't a problem.)
On your own laptop, you may see 100k tables performing fine, and 1M tables not. If the data isn't going to grow anymore and that's the power of hardware you'll have on production, then splitting would be a good idea. However if you're going to always be increasing table size, like 50M as you mention, then splitting it up would only help if you could split indefinetly (like every 2 million rows you divide the table again). In you're case you aren't wanting to continue to divide 1 table to 4 to 20 to 100...so I think it'd be better to leave as 1 table, and if it doesn't perform then look into other datastore types.
Related
Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*
So I'm working on site that will replace an older site with a lot of traffic, and I will also have a lot of data in the DB, so my question to you guys is what is the best way to design mysql tables for growth?
I was thinking to split let's say a table with 5 000 000 rows in 5 tables,with 1 000 000 rows/table and create a relationship between the tables, but I guess this isn't a good option since I will spend a lot of resources and time to figure out in what table my data is.
Or can you guys give me some tips mabe some useful articles?
No, you're absolutely right on the relationships. This technique is called Normalization where you define separate tables because these individual tables are affected with time and independent of other tables.
So if you have a hotel database that keeps a track of rooms and guests, then you know normalization is necessary because rooms and guests are independent of each other.
But you will have foreign keys/surrogate keys in each table (for instance, room_id) that could relate the particular guest entering for that particular room.
Normalization, in your case, could help you optimize that 5000 rows of yours as it would not be optimal for a loop to go over 5000 elements and retrieve an entire data.
Here is a strong example for why normalization is essential in database management.
Partitioning as mentioned in a comment is one way to go, but the first path to check out is even determining if you can break down the tables with the large amounts of data into workable chunks based on some internal data.
For instance, lets say you have a huge table of contacts. You can essentially break down the data into contacts that start from a-d, e-j, etc. Then when you go to add records you just make sure you add the records to the correct table (I'd suggest checking out stored procedures for handling this, so that logic is regulated in the database). You'd also probably set up stored procedures to also get data from the same tables. By doing this however, you have to realize that using auto-incrementing IDs won't work correctly as you won't be able to maintain unique IDs across all of the tables without doing some work yourself.
These of course are the simple solutions. There are tons of solutions for large data sets which also includes looking at other storage solutions, clustering, partitioning, etc. Doing some of these things manually yourself can give you a little bit of an understanding on some of the possibly "manual solutions".
I am making a website with a large pool of images added by users.
I want to choose randomly one image out of this pool, and display it to the user, but I want to make sure that this user has never seen this image before.
So i was thinking that: when a user views an image, I make a row INSERT in MYSQL that would say "This USER has watched THIS IMAGE at (TIME)" for every entry.
But the thing is, since there might be a lot of users and a lot of images, this table can easily grow to tens of thousands of entries quite rapidly.
So alternatively, it might be done like that:
I was thinking of making a row INSERT for every USER, and in ONE field, I insert an array all id's of images that user has watched.
I can even do that to the array:
base64_encode(gzcompress(serialize($array)
And then:
unserialize(gzuncompress(base64_decode($array))
What do you think I should do?
Is the encoding/decoding functions fast enough, or at least faster than the conventional way i was describing at the beginning of the post?
Is that compression good enough to store large chunks of data into only ONE database field? (imagine if the user has viewed thousands images?)
Thanks a lot
in ONE field, I insert an array all id's
In almost all cases, serializing values like this is bad practice. Let the database do what it's designed to do -- efficiently handle large amounts of data. As long as you ensure that your cross table has an index on the user field, retrieving the list of images that a user has seen will not be an expensive operation, regardless of the number of rows in the table. Tens of thousands of entries is nothing.
You should create a new table UserImageViews with columns user_id and image_id (additionally, you could add more information on the view, such as Date/Time, IP and Browser).
That will make queries like "What images the user has (not) seen" much faster.
You should use a table. Serializing data into a single field in a database is a bad practice, as the DBMS has no clue what that data represents and cannot be used in ANY queries. For example, if you wanted to see which users had viewed an image, you wouldn't be able to in SQL alone.
Tens of thousands of entries isn't much, BTW. The main application we develop has multiple tables with hundreds of thousands of records, and we're not that big. Some web applications have tables with millions of rows. Don't worry about having "too much data" unless it starts becoming a problem - the solutions for that problem will be complex and might even slow down your queries until you get to that amount of data.
EDIT: Oh yeah, and joins against those 100k+ tables happen in under a second. Just some perspective for ya...
I don't really think that tens of thousands of rows will be a problem for a database lookup. I will recommend using the first approach over the second.
I want to choose randomly one image out of this pool, and display it
to the user, but I want to make sure that this user has never seen
this image before.
For what it's worth, that's not a random algorithm; that's a shuffle algorithm. (Knowing that will make it easier to Google when you need more details about it.) But that's not your biggest problem.
So i was thinking that: when a user views an image, I make a row
INSERT in MYSQL that would say "This USER has watched THIS IMAGE at
(TIME)" for every entry.
Good thought. Using a table that stores the fact that a user has seen a specific image makes sense in your case. Unless I've missed something, you don't need to store the time. (And you probably shouldn't. It doesn't seem to serve any useful business purpose.) Something along these lines should work well.
-- Predicate: User identified by [user_id] has seen image identified by
-- [image_filename] at least once.
create table images_seen (
user_id integer not null references users (user_id),
image_filename not null references images (image_filename),
primary key (user_id, image_filename)
);
Test that and look at the output of EXPLAIN. If you need a secondary index on image_filename . . .
create index images_seen_img_filename on images_seen (image_filename);
This still isn't your biggest problem.
The biggest problem is that you didn't test this yourself. If you know any scripting language, you should be able to generate 10,000 rows for testing in a matter of a couple of minutes. If you'd done that, you'd find that a table like that will perform well even with several million rows.
I sometimes generate millions of rows to test my ideas before I answer a question on StackOverlow.
Learning to generate large amounts of random(ish) data for testing is a fundamental skill for database and application developers.
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.