I am currently building an application where I am importing statistical data for (currently) around 15,000 products. At current, if I was to maintain one database table for each day statistics from one source it would be increased by 15,000 rows of data (let's say 5-10 fields per row primarily float, int) per day. Obviously equating to over 5 million records per year into one table.
That doesn't concern me so much as the thought of bringing in data from other sources (and thus increasing the size the database by 5 million records for each new source).
Now the data is statistical / trending based data, and will have basically 1 write per day per record, and many reads. For purposes of on the fly reporting and graphing however I need fast access to subsets of the data based on rules (date ranges, value ranges, etc).
What my question is, is this the best way to store the data (MySQL InnoDb tables), or is there a better way to store and handle statistical/trend data?
Other options I have tossed around at this point:
1. Multiple databases (one per product), with separate tables for each data source within.
(ie Database: ProductA, Table(s):Source_A, Source_B, Source_C)
2. One database, multiple tables (one for each product/data source)
(ie Database: Products, Table(s): ProductA_SourceA, ProductA_SourceB, etc.)
3. All factual or specific product information in the database and all statistical data in csv, xml, json, (flat files) in separate directories.
So far, none of these options are very manageable, each has its pros and cons. I need a reasonable solution before I move into the alpha stage of development.
You could try making use of a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
We've had good experience with InfiniDB:
http://infinidb.org/
and Infobright looks good as well:
http://www.infobright.com/
Both InfiniDB and Infobright have free open source community editions, so I would recommend using these to get some benchmarks on the kinds of performance benefit you might get.
You might also want to look at partitioning your data to improve performance.
It's a little bit dependent upon what your data looks like, and the kind of aggregations/trends you're looking to run. Most relational databases work just fine for this sort of chronological data. Even with billions of records, proper indexing and partitioning can make quick work work of finding the records you need. DB's like Oracle, MySQL, SQL-Server fall within this category.
Lets say the products you work with are stocks, and for each stock you get a new price every day (a very realistic case). New exchanges, stocks, trade frequencies will grow this data exponentially pretty quickly. You could however partition the data by exchange. Or region.
Various Business Intelligence tools are also able to assist in, what effectively amounts to pre-aggregating data prior to retrieval. This is basically a Column-oriented database as was suggested. (Data Warehouses and OLAP structures can assist in massaging and aggregating data sets ahead of time).
Similar to the idea of data warehousing, if it's just a matter of the aggregations taking too long, you can work-off the aggregations overnight into a structure which is more quick to query from. In my previous example, you may only need to retrieve large chunks of data very infrequently, but more often some aggregation such as 52 week high. You can store the large amount of raw data in one format, and then every night have a job work off only what you need into a table which rather than thousands of data points per stock, now has 3 or 4.
If the trends you're tracking are really all over the place, or complex algorithms, a full fledged BI solution might be something to investigate so you can use pre-built analityic and data mining algorithms.
If the data is not very structured, you may have better luck with a NoSQL database like Hadoop or Mongo, although admittedly my knowledge of databases is more focused around relational formats.
Changing the data from relational to non-relational like graphs, Converting data to better and organized forms like using Data Marts and Data Lakes. Using Data Mining algorithms. Processing data together by using techniques like map reduce. Converting ACID properties to BASIC.
Related
I am currently working on a PHP application (pre-release).
Background
We have the a table in our MySQL database which is expected to grow extremely large - it would not be unusual for a single user to own 250,000 rows in this table. Each row in the table is given an amount and a date, among other things.
Furthermore, this particular table is read from (and written to) very frequently - on the majority of pages. Given that each row has a date, I'm using GROUP BY date to minimise the size of the result-set given by MySQL - rows contained in the same year can now be seen as just one total.
However, a typical page will still have a result-set between 1000-3000 results. There are also places where many SUM()'s are performed, totalling many tens - if not hundreds - of thousands of rows.
Trying MySQL
On a usual page, MySQL was usually taking around around 600-900ms. Using LIMIT and offsets weren't helping performance and the data has been heavily normalised, and so it doesn't seem like further normalisation would help.
To make matters worse, there are parts of the application which require the retrieval of 10,000-15,000 rows from the database. The results are then used in a calculation by PHP and formatted accordingly. Given this, the performance of MySQL wasn't acceptable.
Trying MongoDB
I have converted the table to MongoDB, and it's speed is faster - it usually takes around 250ms to retrieve 2,000 documents. However, the $group command in the aggregation pipeline - needed to aggregate fields depending on the year they fall in - slows things down. Unfortunately, keeping a total and updating that whenever a document is removed/updated/inserted is also out of the question, because although we can use a yearly total for some parts of the app, in other parts the calculations require that each amount falls on a specific date.
I've also considered Redis, although I think the complexity of the data is beyond what Redis was designed for.
The Final Straw
On top of all of this, speed is important. So performance is up there it terms of priorities.
Questions:
What is the best way to store data which is frequently read/written and rapidly growing, with the knowledge that most queries will retrieve a very large result-set?
Is there another solution to the problem? I'm totally open to suggestions.
I'm a little stuck at the moment, I haven't been able to retrieve such a large result-set in an acceptable amount of time. It seems most datastores are great for small retrieval sizes - even on large amounts of data - but I haven't been able to find anything on retrieving large amounts of data from an even larger table/collection.
I only read the first two lines but you are using aggregation (GROUP BY) and then expecting it to just do realtime?
I will say you are new to the internals of databases not to undermine you but to try and help you.
The group operator in both MySQL and MongoDB is in-memory. In other words it takes whatever data structure you povide, whether it be an index or a document (row) and it will go through each row/document taking the field and grouping it up.
This means that you can speed it up in both MySQL and MongoDB by making sure you are using an index for the grouping, but still this only goes so far, even with housing the index in your direct working set in MongoDB (memory).
In fact using LIMIT with a OFFSET as well is probably just slowing things down even further frankly. Since after writing out the set MySQL then needs to query again to get your answer.
Once done it will write out the result, MySQL will write it out to a result set (memory and IO being used here) and MongoDB will reply inline if you have not set $out, the maximum size of the inline output being 16MB (the maximum size of a document).
The final point to take away here is: aggregation is horrible
There is no silver bullet that will save you here, some databases will attempt to boast about their speed etc etc but fact is most big aggregators use something called "pre-aggregated reports". You can find a quick introduction within the MongoDB documentation: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/
This means that you put the effort of aggregating and grouping onto some other process which could do it easily enough allowing your reading thread, the one that needs to be realtime to do it's thang in realtime.
Im in the process of developing a large scale application that will contain a few tables with a large dataset. (Potentially 1M+ rows). This application will be a game with multiple users completing tasks at the same time and will be very data intensive.
In this application, data will be aggregated for users statistics. I have came up with two scenarios to achieve my desired affect of calculating all the statistics.
Scenario 1
Maintain a separate table to calculate user statistics. Meaning as a move is processed, the field would increase by one.
Table Statistics (Moves, Origins, Points)
$Moves++;
$Origins++
$Points = $Points + $Points;
Scenario 2
Count and sum the data fields as needed across all data.
Table Moves (Points, Origins)
SUM(Points)
SUM(Origins)
COUNT(Moves)
My question is, which of these two scenarios would be the most efficient on the database driver. It is my belief that Scenario 2 could possibly be more efficient because there will be far less data manipulation, but I'm unsure of the load that these queries may place on the DB.
I am using MySQL 5.5 InnoDB with a UTF8 Charset
The best route will depend on the frequency of reads vs. writes of points, origins and moves. Those frequencies, in turn, will be dependent upon use cases, code style and use (or lack) of caching.
It's difficult to provide a qualified opinion without more details, but consider the fact that a dedicated table brings with it some additional complications in the way of additional writes necessary for each operation and ensuring that those data tallies must always be correct (match the underlying detail data). In light of the additional complication storing logical data elements once rather than twice in a relational database is usually the best course of action.
If you're worried about performance and scaleability you might want to consider a non-relational approach using database platforms like Mongo or DynamoDB.
Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*
I'm building an analytics system for a mobile application and have had some difficulty deciding how to store and process large amounts of data.
Each row will represent a 'view' (like a web page) and store some fixed attributes, like user agent and date. Additionally, each view may have a varying number of extra attributes, which relate to actions performed or content identifiers.
I've looked at Amazon SimpleDb which handles the varying number of attributes well, but has no support for GROUP BY and doesn't seem to perform well when COUNTing rows either. Generating a monthly graph with 30 data points would require a query for each day per dataset.
MySQL handles the COUNT and GROUP modifiers much better but additional attributes require storage in a link table and a JOIN to retrieve views where attributes match a given value, which isn't very fast. 5.1's partitioning feature may help speed things up a bit.
What I have gathered from a lot of reading and profiling queries on the aforementioned systems is that ultimately all of the data needs to be aggregated and stored in tables for quick report generation.
Have I missed anything obvious in my research and is there a better way to do this than use MySQL? It doesn't feel like the right task for the job, but I can't find anything capable of both GROUP/COUNT queries and a flexible table structure.
This is a case where you want to store the data once and read it over and over. Further I think that you'd wish the queries to be preprocessed instead of needing to be calculated on every go.
My suggestion for you is to store your data in CouchDB for the following reasons:
Its tables are structureless
Its queries are pre-processed
Its support for map-reduce allows your queries to handle group by
It has a REST service access model which lets you connect from pretty much anything that handle HTTP requests
You may find this suggestion a little out there considering how new CouchDB is. However I'd suggest for you to read about it because personally I think running a CouchDB database is sweet and lightweight. More light weight than MySQL
Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.
The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like
UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).
However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.
building a site using PHP and MySQL that needs to store a lot of properties about users (for example their DOB, height, weight etc) which is fairly simple (single table, lots of properties (almost all are required)).
However, the system also needs to store other information, such as their spoken languages, instrumental abilities, etc. All in all their are over a dozen such characteristics. By default I assumed creating a separate table (called maybe languages) and then a link table with a composite id (user_id, language_id).
The problem I foresee though is when visitors attempt to search for users using these criteria. The dataset we're looking to use will have over 15,000 users at time of launch and the primary function will be searching and refining users. That means hundreds of queries daily and the prospect of using queries with up a dozen or more JOINs in them is not appealing.
So my question is, is there an alternative that's going to be more efficient? One way I was thinking is storing the M2M values as a CSV of IDs in the user table and then running a LIKE query against it. I know LIKE isn't the best, but is it better than a join?
Any possible solutions will be much appreciated.
Do it with joins. Then, if your performance goals are not met, try something else.
Start with a normalized database (e.g. a languages table, linked to the users table by a mapping table) to make sure you data is represented cleanly and logically.
If you have performance problems, examine your queries and make sure you have suitable indexes.
If you dislike repeatedly coding up queries with many joins, define some views.
If views are very slow to query, consider materialized views.
If you have several thousand records and a few hundred queries per day (really, that's pretty small and low-usage), these techniques will allow your site to run at full speed, with no compromise on data integrity. If you need to scale to many millions of records and millions of queries per day, even these techniques may not be enough; in which case, investigate cacheing and denormalization.