I have a program that keeps track of games being played for the year. Once that year is done I was planning to allow admin to archive that season. To do this I created a copy of that table with a name for the previous season. I stored that name in a separate table.
Problem is I guess I cannot do sql query with the table name generated from the table.
Anyone have any advice on how to handle this. The table could have 1000-2000 records per year, and this could be in place in 20 years (probably not with technology). I could add a column for year and use that as a sort, but is it going to get slow when I hit 20K - 40K worth of records?
Another idea I had is keep the active year in one table and archive all the history to another separate one? Users don't care if history takes a few seconds to load, but not current data?
Thanks
Slowness is a function of query complexity as much as it is of record size. You can make slow queries with a 100 rows or fast ones with millions. If you need specific help on query optimization, post it, otherwise there isn't much to work with, except the previously suggested indexes.
Generally a 10k order of magnitude will do nothing special. Don't try to overoptimize before you actually identify the weak points. Partitioning the tables is very rarely neccessary before hitting millions, if not tens of millions, of rows. You're falling into the 'premature optimization' trap. Don't do it unless you're certain it's going to cause problems.
Me, I'd throw an index on the columns you use to JOIN the table (probably user_id) and be done with it. There's an automatic one on it if you're using foreign keys. Until you see your queries slowing down the application, you'd probably do more harm than good by forcing it to work around partitioning.
Related
I would like to store big chunks of data in RAM using sphinx /solr/ elastic search whatever else suits such needs (The problem is I don't know what tool suits the best I had only heard that people use them).
I build reports about sales, I get nearly 800-900k lines of sales per month and user wants to scroll the page and see them smoothly.
I can't give them all data at once becasue browser will just hang
in the same time I can't use LIMIT from mysql because queries demand merging cross tables.
Recalculating it on the flow is not an option.
Creating a temp table in mysql is a bad idea because there are a bunch of criteria and more than one user can view data.
Temporary_table
id product_id product_count order_id order_status.... .....user_id
Having such table I would store all result for current user in the table and would hold them there as long as user doesn't make a new query. But I don't like this solution. There must be something better.
I feel like it's over my head.
Any ideas?
"Drill down", don't "Scroll down" !
When I need to present a million lines of info, I start by thinking of way to slice and dice it -- subtotals by hour, by region, by product type, by whatever. Each slice might be a hundred lines -- quite manageable, especially with summary tables.
In that hundred lines would be clickable items that take the user to a more detailed page about one of the items. That would also have a hundred lines (or 10 or 1000 -- whatever makes sense; but >1000 is usually unreasonable). That page may have further links to drill further down. And/or links to move laterally.
With suitable slicing and dicing, you are very unlikely to need to send him a million lines; only a few hundred.
With suitable Summary Tables, the "tmp tables", etc, go away.
Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*
A friend of mine has a catalogue that currently holds about 500 rows or 500 items. We are looking at ways that we can provide reports on the catalogue inclduing the number of times an item was viewed, and dates for when its viewed.
His site is averaging around 25,000 page impressions per month and if we assumed for a minute that half of these were catalogue items then we'd assume roughly 12,000 catalogue items viewed each month.
My question is the best way to manage item views in the database.
First option is to insert the catalogue ID into a table and then increment the number of times its viewed. The advantage of this is its compact nature. There will only ever be as many rows in the table as there are catalogue items.
`catalogue_id`, `views`
The disadvantage is that no date information is being held, short of maintaining the last time an item was viewed.
The second option is to insert a new row each time an item is viewed.
`catalogue_id`, `timestamp`
If we continue with the assumed figure of 12,000 item views that means adding 12,000 rows to the table each month, or 144,000 rows each year. The advantage of this is we know the number of times the item is viewed, and also the dates for when its viewed.
The disadvantage is the size of the table. Is a table with 144,000 rows becoming too large for MySQL?
Interested to hear any thoughts or suggestions on how to achieve this.
Thanks.
As you have mentioned the first is a lot more compact but limited. However if you look at option 2 in more detail; for example if you wish to store more than just view count, for instance entry/ exit page, host ip ect. This information maybe invaluable for stats and tracking. The other question is are these 25,000 impressions unique? If not you are able to track by username, ip or some other unique identifier, this could enable you to not use as many rows. The answer to your question relies on how much detail you wish to store? and what is the importance of the data?
Update:
True, limiting the repeats on a given item due to a time interval would be a good solution. Also knowing if someone visited the same item could be useful for suggested items perdition widgets similar to what amazon does. Also knowing that someone visited an item many times says to me that this is a good item to promote to them or others in a mail-out, newsletter or popular product page. Tracking unique views will give a more honest view count, which you can choose to display or store. On the issue of limiting the value of repeat visitors, this mainly only comes into play depending on what information you display. It is all about framing the information in the way that best suits you.
Your problem statement: We want to be able to track number of views for a particular catalogue item.
Lets review you options.
First Option:
In this option you will be storing the catalogue_id and a integer value of the number of views of the items.
Advantages:
Well since you really have a one to one relationship the new table is going to be small. If you have 500 items you will have 500 hundred rows. I would suggest if you choose this route not to create a new table but add another column to the catalogue table with the number of views on it.
Disadvantages:
The problem here is that since you are going to be updating this table relatively frequently it is going to be a very busy little table. For example 10 users are viewing the same item. These 10 updates will have to run one after the other. Assuming you are using InnoDB the first view action would come in lock the row update the counter release the lock. The other updates would queue behind it. So while the data is small on the table it could potentially become a bottleneck later on especially if you start scaling the system.
You are loosing granular data i.e. you are not keeping track of the raw data. For example lets say the website starts growing and you have a interested investor they want to see a breakdown of the views per week over the last 6 months. If you use this option you wont have the data to provide to the investor. Essentially you are keeping a summary.
Second Option:
In this option you would create a logging table with at least the following minimal fields catalogue_id and timestamp. You could expand this to add a username/ip address or some other information to make it even more granular.
Advantages:
You are keeping granular data. This will allow you to summarise the data in a variety of ways. You could for example add a ip address column store the visitors IP and then do a monthly report showing you products viewed by country(you could do a IP address lookup to get a idea of which country they were from). Another example would be to see over the last quarter which products was viewed the most etc. This data is pretty essential in helping you make decisions on how to grow you business. If you want to know what is working what is not working as far as products are concerned this detail is absolutely critical.
Your new table will be a logging table. It will only be insert operations. Inserts can pretty much happen in parallel. If you go with this option it will probably scale better as the site grows compared to a constantly updated table.
Disadvantages:
This table will be bigger probably the biggest table in the database. However this is not a problem. I regularly deal with 500 000 000 rows+ tables. Some of my tables are over 750GB by themselves and I can still run reporting on it. You just need to understand your queries and how to optimise them. This is really not a problem as MySQL was designed to handle millions of rows with ease. Just keep in mind you could archive some information into other tables. Say you archive the data every 3 years you could move data older than 3 years into another table. You dont have to keep all the data there. Your estimate of 144 000 rows means you could probably safely keep about 15+ years worth without every worrying about the performance of the table.
My suggestion to you is to serious consider the second option. If you decide to go this route update your question with the proposed table structures and let us have a look at it. Don't be scared of big data rather be scared of BAD design it is much more difficult to deal with.
However as always the choice is yours.
I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.
I'm making a forum.
And I'm wondering if i should store the number of replies in the topic table or count the posts of the topic?
How much slower will it be if i use sql and count them? Lets say i have a billion posts.
Will it be much slower? Im not planning on being that big but what if? How much slower would i be compared to stroing the num in topics?
Thanks
It will be slower as your db grows in size. If you are planning on having a large post table, store the value in the topic table
I just ran some tests on a MySQL 4.0 box we have using a table with over 1 million records.
SELECT COUNT(*) FROM MyTable; ~1 million took 22ms
SELECT COUNT(*) FROM MyTable WHERE Role=1; ~800,000 took 3.2s
SELECT COUNT(*) FROM MyTable WHERE Role=2; ~20 took 12ms
The Role column in this case was indexed and this was connecting to the MySQL remotely.
I think your posts table will have to get very large for the query times to really become an issue. I also think it is a pre-optimization to put the cache of the count in your topics table. Build it without it for now and if it becomes a problem its a pretty easy update to change it.
Do not store the value in a table.
Cache the value in the application for some time so the count(*) query wont be executed too often.
Choose cache time depending on the server load: higher for very busy and zero for couple of users.
The count(*) in SqlServer is pretty fast (assuming you have index on the field you are counting on). So you just need to reduce number of hits under the heavy load.
If you will store the value in a table you will have a lot of hassle maintaining it.
This is going to affect scaling and is an issue of normalization. Hardcore normalization nerds will tell you that you shouldn't keep the number of posts on the topic because it causes redundant data. But you need to keep in mind that if you don't store it there you need to do an extra query on every load to fetch the number. The alternative is to do an extra query on every update/insert instead, which will almost always occur much less often than select's. As you scale a site to support a lot of traffic it becomes almost inevitable that you have to eventually start to de-normalize some of your data, especially in cases like this.
Redundant data isn't inherently bad. Poorly managed redundancy is. As long as you have the proper checks in place to prevent the data from getting out of sync then the potential benefit of storing the number of posts on the thread is worth the extra bit of code IMO.
I think a lot of this will depend on how rapidly you're pushing data in. If you store the value in a topic table, then you may find that you're needing to increment (or decrement if you delete records) very frequently too.
Indexes (indices?) may be a nicer option, as you can store a tiny subset of the data, and be able to access richer information. Consider the fact that it can be quite quick to count how many Farleys there are in the phone-book, because I can go straight there and easily count them.
So, as is often the case, the answer is probably 'It depends'.
I like storing counts in the table rather than counting them every time. It's such an easy operation and you never have to think about the expense of showing it when you're retrieving it. With a forum you're going to be displaying it more often than you're going to be changing it anyway so it makes sense to make that as cheap as possible. It might be a bit premature but it might save you some headaches later.