Optimization: large MySQL table, only recent records used

Optimization: large MySQL table, only recent records used - php

I have an optimization question.
The PHP web application, that I have recently started working with, has several large database tables in a MySQL database. The information in this tables should be accessible at all times for business purposes, which makes them grow really big eventually.
The tables are regularly written to and recent records are frequently selected.
Previous developers came up with a very weird practice of optimizing the system. They created separate database for storing recent records in order to keep tables compact and sync the tables once the record grows "old" (more than 24 hours old).
The application uses current date to pick the right database, when performing a SELECT query.
This is a very weird solution in my opinion. We had a big argument over that and I am looking to change this. However, before, I decided to ask:
1) Has someone ever came across anything similar before? I mean, separate database for recent records.
2) What are the most common practices to optimize databases for this particular case?
Any opinions are welcome, as there are many ways one can go at this point.

Try using INDEX:
CREATE INDEX
That improve the access, use and deploy of the information.

I believe this could help you RANGE Partitioning

The solution is to do a Partion to the table base on a date range.
By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan. Maintenance tasks, such as rebuilding indexes or backing up a table, can run more quickly.
The documentation of Mysql can be useful, check this out :
https://dev.mysql.com/doc/refman/5.5/en/partitioning-columns-range.html

Related

Will splitting one large MySQL query into multiple smaller queries help mitigate table locking / force a script to "yield" to other PHP scripts?

We have a large number of tables in our company's MySQL database, each representing different products (and/or history/transactions for those products) plus a "main" table for parent establishments. Almost all of these tables are using MyISAM (changing everything to InnoDB might help but it's not an option at the moment).
We have a "filter" tool in our backend for finding establishments that match certain criteria. The results are printed in tabular format with all data available for that establishment (ID, name, which products they do/don't have, how many transactions, etc. etc.) and currently this is achieved with a very large MySQL statement with many JOINs.
We had a situation last week where a particularly large filter was run during peak business hours and the resulting READ LOCKs on dependent tables (via the aforementioned JOIN statements) caused the entire database to stop responding for almost 30mins even though the filter in question only takes ~43s to run on it's own (locally, anyway). Very bad.
While important, this filter tool is only used by a few people on the team and not by clients. The speed/performance of this filter tool is not critical nor the goal of this question. I would prefer for this tool to "yield" to other apps that need access to these tables rather than force them to wait until the entire filter has finished.
Which brings me to my question; Will splitting one large query (with multiple JOINs) into multiple smaller queries help mitigate table locking and force a script to "yield" to other, higher priority scripts that might need access to the same tables in between the smaller queries?
Disclaimer: I have reviewed so many other questions here on StackOverflow and on other sites via Google over the last week and they're all interested in speed. That is not what I am asking. If this is a duplicate I apologize and it can be locked, but please provide a link to it so that I may use it. Thank you!
EDIT: I appreciate the comments thus far and the additional information/ideas they provide, though none have answered the question unfortunately. I'm in a position at the company where I have control over the filter's code and that's it. I cannot change the database engine, I cannot initiate replication or create data warehouses, and I'm already aware that MyISAM is the inferior choice for tables, but I don't have control over that. Thank you.

offloading data to another db-table and cache

I have an online store, our products come from a 4 table join.
I want to move away from these joins for the following reasons:
too expensive on the database.
when I need to query data, I want to use simpler queries.
I am thinking of offloading the data into a simpler form into another DB and table.
Then, in addition, cache that data coming from the new table.
This gives me:
Good performance
simpler querying when I need to perform on the fly lookups using a DB client.
Can anyone weigh in on whether or not this is a good approach?
Am I overdoing it?

This is not a good approach, what you are doing is denormalizing and this should only be done as a last resort if you really need to increase performance in your system. I've worked on websites with over 10 million views per month and even on those sites it was only necessary for some specific use cases.
MySQL joins are very fast, and joining on 4 tables is nothing, I've written queries joining to 15 tables that ran in less than 0.001s, if your indexes are done right the difference won't be noticeable.
What you're doing is both Premature Optimization and query writing laziness, unless your online store gets hundreds of thousands (or even millions) of visits every day you are not focusing on the right things, data integrity and consistency is way more important.

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.

If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.

Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.

i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps

Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

How should I version my data in an MS SQL shared server environment?

The server is a shared Windows hosting server with Hostgator. We are allowed "unlimited" MS SQL databases and each is allowed "unlimited" space. I'm writing the website in PHP. The data (not the DB schema, but the data) needs to be versioned such that (ideally) my client can select the DB version he wants from a select box when he logs in to the website, and then (roughly once a month) tag the current data, also through a simple form on the website. I've thought of several theoretical ways to do this and I'm not excited about any of them.
1) Put a VersionNumber column on every table; have a master Version table that lists all versions for the select box at login. When tagged, every row without a version number in every table in the db would be duplicated, and the original would be given a version number.
This seems like the easiest idea for both me and my client, but I'm concerned the db would be awfully slow in just a few months, since every table will grow by (at least) its original size every month. There's not a whole lot of data, and there probably never will be, in any one version. But multiplying versions in the same table just scares me.
2) Duplicate the DB every time we tag.
It looks like this would have to be done manually by my client since the server is shared, so I already dislike the idea. But in addition, the old DBs would have to be able to work with the current website code, and as changes are made to the DB structure over time (which is inevitable) the old DBs will no longer work with the new website code.
3) Create duplicate tables (with the version in their name) inside the same database every time we tag. Like [v27_Employee].
The benefit here over idea (1) would be that no table would get humongous in size, allowing the queries to keep up their speed, and over idea (2) it could theoretically be done easily through the simple website tag form rather than manually by my client. The problems are that the queries in my PHP code are going to get all discombobulated as I try to explain which Employee table is joining with which Address table depending upon which version is selected, since they all have the same name, but different; and also that as the code changes, the old DB tables no longer match, same problem as (2).
So, finally, does anyone have any good recommendations? Best practices? Things they did that worked in the past?
Thanks guys.

Option 1 is the most obvious solution because it has the lowest maintenance overhead and it's the easiest to work with: you can view any version at any time simply by adding #VersionNumber to your queries. If you want or need to, this means you could also implement option 3 at the same time by creating views for each version number instead of real tables. If your application only queries one version at a time, consider making the VersionNumber the first column of a clustered primary key, so that all the data for one version is physically stored together.
And it isn't clear how much data you have anyway. You say it's "not a whole lot", but that means nothing. If you really have a lot of data (say, into hundreds of millions of rows) and if you have Enterprise Edition (you didn't say what edition you're using), you can use table partitioning to 'split' very large tables for better performance.
My conclusion would be to do the simplest, easiest thing to maintain right now. If it works fine then you're done. If it doesn't, you will at least be able to rework your design from a simple, stable starting point. If you do something more complicated now, you will have much more work to do if you ever need to redesign it.

You could copy your versionable tables into a new database every month. If you need to do a join between a versionable table and a non-versionable table, you'd need to do a cross-schema join - which is supported in SQL Server. This approach is a bit cleaner than duplicating tables in a single schema, since your database explorer will start getting unwieldy with all the old tables.

What I finally wound up doing was creating a new schema for each version and duplicating the tables and triggers and keys each time the DB is versioned. So, for example, I had this table:
[dbo].[TableWithData]
And I duplicated it into this table in the same DB:
[v1].[TableWithData]
Then, when the user wants to view old tables, they select which version and my code automatically changes every instance of [dbo] in every query to [v1]. It's conceptually fairly simple and the user doesn't have to do anything complicated to version -- just type in "v1" to a form and hit a submit button. My PHP and SQL does the rest.
I did find that some tables had to remain separate -- I made a different schema called [ctrl] into which I put tables that will not be versioned, like the username / password table for example. That way I just duplicate the [dbo] tables.
Its been operational for a year or so and seems to work well at the moment. They've only versioned maybe 4 times so far. The only problem I seem to have consistently that I can't figure out is that triggers seem to get lost somehow. That's probably a problem with my very complex PHP rather than the DB versioning concept itself though.

Should I break a larger mysql table into multiple?

I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg

As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.

should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.

You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.

You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.

I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.

Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)

Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.