Suggestions on Structuring a Database with Large Amounts of Data

Suggestions on Structuring a Database with Large Amounts of Data - php

I'm doing an RIA with JavaScript, MySQL and PHP on a Windows server.
I have 5,000 identically structured data sets I want to put in a database. 5 tables is enough for the data, all of which will be reasonably small except for one table that will have 300,000+ records for a typical data set.
Additionally, 500 users will get read only access to statistics compiled from those data sets. Those statistics are provided by PHP (no direct access is allowed). What's more, their access to data varies. Some users can only use one data set, others some, a few, all.
The results users see are relatively small; most requests return well under 100 rows, and the largest requests will be about 700 rows. All requests are through a JavaScript RIA which uses Ajax to connect to PHP which in turn connects to the data, does its thing and outputs JSON in response, which JavaScript then presents accordingly.
In thinking about how to structure this, three options present themselves:
Put the data sets in the same tables. That could easily give me 1,500,000,000 records in the largest table.
Use separate tables for each data set. That would limit the largest table size, but could mean 25,000 tables.
Forget the database and stick with the proprietary format.
I'm leaning towards #2 for a few reasons.
I'm concerned about issues in using very large tables (eg: query speeds, implementation limits, etc...).
Separate tables seem safer; they limit the impact of errors and structure changes.
Separate tables allow me to use MySQL's table level security rather than implementing my own row level security. This means less work and better protection; for instance, if a query is accidentally sent without row level security, users can get unauthorized data. Not so with table level security, as the database will reject the query out of hand.
Those are my thoughts, but I'd like yours. Do you think this is the right choice? If not, why not? What considerations have I missed? Should I consider other platforms if scale-ability is an issue?

1) I'm concerned about issues in using very large tables (eg: query speeds, implementation limits, etc...).
Whether the DBMS has to...
search through the large index of one table,
or search for the right table and then search through the smaller index of that table
...probably doesn't make much of a difference performance-wise. If anything, the second case has an undocumented component (the performance of locating the right table), so I'd be reluctant to trust it fully.
If you want to physically partition the data, MySQL supports that directly since version 5.1, so you don't have to emulate it via separate tables.
2) Separate tables seem safer; they limit the impact of errors and structure changes.
That's what backups are for.
3) Separate tables allow me to use MySQL's table level security rather than implementing my own row level security.
True enough, however similar effect can be achieved through views or stored procedures.
All in all, my instinct is to go with a single table, unless you know in advance that these data-sets differ enough structurally to warrant separate tables. BTW, I doubt you'd be able to do better with a proprietary format compared to a well-optimized database.

Related

MySQL Count / Sum Performance

Im in the process of developing a large scale application that will contain a few tables with a large dataset. (Potentially 1M+ rows). This application will be a game with multiple users completing tasks at the same time and will be very data intensive.
In this application, data will be aggregated for users statistics. I have came up with two scenarios to achieve my desired affect of calculating all the statistics.
Scenario 1
Maintain a separate table to calculate user statistics. Meaning as a move is processed, the field would increase by one.
Table Statistics (Moves, Origins, Points)
$Moves++;
$Origins++
$Points = $Points + $Points;
Scenario 2
Count and sum the data fields as needed across all data.
Table Moves (Points, Origins)
SUM(Points)
SUM(Origins)
COUNT(Moves)
My question is, which of these two scenarios would be the most efficient on the database driver. It is my belief that Scenario 2 could possibly be more efficient because there will be far less data manipulation, but I'm unsure of the load that these queries may place on the DB.
I am using MySQL 5.5 InnoDB with a UTF8 Charset

The best route will depend on the frequency of reads vs. writes of points, origins and moves. Those frequencies, in turn, will be dependent upon use cases, code style and use (or lack) of caching.
It's difficult to provide a qualified opinion without more details, but consider the fact that a dedicated table brings with it some additional complications in the way of additional writes necessary for each operation and ensuring that those data tallies must always be correct (match the underlying detail data). In light of the additional complication storing logical data elements once rather than twice in a relational database is usually the best course of action.
If you're worried about performance and scaleability you might want to consider a non-relational approach using database platforms like Mongo or DynamoDB.

PHP array VS MSQL table

I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).

It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.

It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.

If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.

Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.

i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps

Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

JS, PHP, and MySQL to get large data

I am using Ajax to send query to PHP server, which then run the SQL query to get data. Because the query involves three tables (two large ones), so JOIN the three tables is very slow.
Then I split the SQL query to three queries. It improves the efficiency (for small dataset). But for large dataset, because the PHP program runs the three queries one by one, and processes the result after each, there will be 30 second timeout (by default). I don't want to remove this default setting.
To avoid timeout, I am also considering running the three query and returning the result to JS, and let client side to do processing.
Is there other way to do that?
add
Basically, I want three output, title, extviews, allviews, for each item, WHERE extviews>somevalue. title is from one small table, extviews and allviews are aggregated from two different large tables. I have all the fields indexed, but joining the two big tables still requires a long time.
So I first aggregate one table to get extviews for each item, and also a list of item id. The results are organized as an array for JSON output to JS. Then using the list of id, I get the title for each item, and aggregate the other table to get allviews. Then I update the array with the new results.

Unless your mysql server is really overloaded, it's usually quickier to use joins. I guess you've already defined indexes on your tables? (for fields used in join condition & where clauses)
Doing the processing on the client side might also be a problem, since you'll have to send a lot of data in order to do the join...
Edit:
If all "easy" optimisation is done, then you have 2 choices... The one you just described (doing it on client size, if it's possible - what is the size (in bytes) of the json arrays you send to the client?)
Your other choice is to do the processing in the background (via cron) & cache somehow the results.

As already indicated by other people responding to your post, you should give us an idea of the structure of your three tables and the intent of each. Based upon that information, you may be able to get significant performance improvements by optimizing your database structure. To make it easier to understand, let's assume that someone had a website running off an intelligently designed database. I could easily make that application perform ten times worse solely by modifying the structure of the database.
Now, maybe there's some reason why you need to have three distinct tables, but I can't make that judgment without knowing what the fields in the database are, what you're aggregating, and what your web application is doing in the first place. Is it read heavy or write heavy? The solution may be as simple as denormalizing your database so that you don't need to use any joins.
I can say from a cursory glance at your description of what you're doing, that this application can't possibly scale efficiently and that you really need to reconsider your design. The first warning sign for me is the fact that you stated that one of the joins is just to link the title to two other tables. To me, being forced to do a join just to get a title of an object seems indicative of over-normalization. Some data redundancy is not necessarily a bad thing, and in some situations it's absolutely mandatory. Also, you say that you have two large tables that you use aggregate functions on and then join everything together. I can tell you right now that you're going to run into some serious performance issues if every hit to your application involves using a triple join and two aggregate functions, I'm assuming count.
Ultimately, we'll be able to give you a better response once you provide more information as to what you're trying to accomplish, and the general structure of the database you set up for it.

Should I break a larger mysql table into multiple?

I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg

As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.

should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.

You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.

You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.

I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.

Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)

Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.