Data storage for stock price data, daily AND weekly?

Data storage for stock price data, daily AND weekly? - php

I am storing price history data for 3500 different stocks from 1970 to present (with a cron job running to update it every day).
What is the best way to store this data? It will be used to run calculations based on both daily data and weekly data. Currently I am storing it as:
stock_id, date, closing_price, high, low, open, volume
Since I want weekly price as well, should I make a separate table to store:
stock_id, week_end_date, weekly_closing_price, weekly_high, weekly_low, week_open_price, average_daily_volume, total_weekly_volume
Since this data is all calculable from the first table, is it necessary to store it again? The only reason I am considering it is that there are a LOT of rows of data to be running calculations.....

It depends on how much data you have and if you what your other transactional requirements are.
It doesn't make sense to duplicate this data in your source/OLTP system if you have one. I'm a SQL Server programmer, not MySQL, but I imagine they have datepart functions like all other RDBMS so determining a week number from a date is trivial.
When you get to OLAP or reporting, though, you may want to make another table with data at your week-level granularity. This will make reporting much faster, especially for things like aggregations which typically don't perform well when run against the output of a function.
Both these depend on the scale of your data. If you have hundreds of rows per day, it may not be worthwhile to do a materialized weekly table for that. If you have tens of thousands of records per day then the performance benefits will probably make it a reasonable option.

You ask if it's necessary? Who knows. That depends on how much disk space you have. However, what you are describing is an "old fashioned" aggregation table and is often used to improve reporting performance. When dealing with historical data, there's no need to recalculate things like weekly totals since the data doesn't change.
In fact, if I were doing this, I'd also define "monthly" and "annual" summary tables for more flexibility, especially for so much history. You can consider "standardizing" the data in such a way that each period is comparable. Calendar months and weeks have different numbers of trading days so things like "average daily volume" might be misleading.
If you really want to get fancy, do some research on ROLAP solutions. It's a very broad topic but you might find it useful.

Since this data is all calculable from the first table, is it necessary to store it again?
It's not necessary to summarize it and store it. You can just create a view that does all the summary calculations, and query the view.
If you're going to run reports over the full range of data a lot, though, it makes sense to summarize it once, and store the result. You're going to start with about 40 million rows. (3500 stocks * 43 years * about 265 days/year)
If I were in your shoes, I'd load the data, write the query for weekly prices, and test the performance. If it's too slow, insert the summary data into a table.

Related

MySQL big chunks of data - fast access without recalculation

I would like to store big chunks of data in RAM using sphinx /solr/ elastic search whatever else suits such needs (The problem is I don't know what tool suits the best I had only heard that people use them).
I build reports about sales, I get nearly 800-900k lines of sales per month and user wants to scroll the page and see them smoothly.
I can't give them all data at once becasue browser will just hang
in the same time I can't use LIMIT from mysql because queries demand merging cross tables.
Recalculating it on the flow is not an option.
Creating a temp table in mysql is a bad idea because there are a bunch of criteria and more than one user can view data.
Temporary_table
id product_id product_count order_id order_status.... .....user_id
Having such table I would store all result for current user in the table and would hold them there as long as user doesn't make a new query. But I don't like this solution. There must be something better.
I feel like it's over my head.
Any ideas?

"Drill down", don't "Scroll down" !
When I need to present a million lines of info, I start by thinking of way to slice and dice it -- subtotals by hour, by region, by product type, by whatever. Each slice might be a hundred lines -- quite manageable, especially with summary tables.
In that hundred lines would be clickable items that take the user to a more detailed page about one of the items. That would also have a hundred lines (or 10 or 1000 -- whatever makes sense; but >1000 is usually unreasonable). That page may have further links to drill further down. And/or links to move laterally.
With suitable slicing and dicing, you are very unlikely to need to send him a million lines; only a few hundred.
With suitable Summary Tables, the "tmp tables", etc, go away.

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.

If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.

Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.

i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps

Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

Insert Total value of order, or calculate every time?

I'm builing a simple e-commerce website, and i'm having some doubts on the order confirmation, and creation...
It's more a question of good practices then a real question...
Should I: calculate the total order value, and insert on the database, or should i calculate it every time i read/dosomething with the order?
Thanks

Best practices in database land means normalised data, and storing values that can be calculated violates that.
You should never store things that you can calculate unless you're absolutely certain that there won't be a discrepancy between the two values.
For example, what do you think should happen if your order consists of two $100 items but the order is set to $150?
Sometimes, there are justifications (usually performance related) to storing values that can be otherwise calculated but the performance gains have to be both significant and necessary, and the possibility of inconsistency either removed or planned for.
You can remove the possibility with things like triggers or materialised views, or you can plan for them by changing business logic to detect and fix problems, or by other means.
But usually, the performance gains aren't worth the extra effort of mitigating the potential problems. After all, how many orders do you see with millions of individual items on them? Other than the US DoD, of course :-)

In your transactional database (ie the main, live one) you should not store the calculated value. Summing a dozen rows is nothing.
However, in your analytic database (aka business intelligence db-, data warehouse) you should definitely store the calculated total.

As a good practice, you should not store values that can be derived from existing columns in the table.
There are many reasons not to do this such as
If you want to update any of the underlying columns , then you have to update the derived column as well.
size of the table will be increased
etc.. etc..

Is there any advantage to calculating the duration in MySQL as opposed to calculating duration in PHP (and then storing in MySQL)?

QUESTION: Is there any advantage to calculating the duration in MySQL as opposed to calculating duration in PHP (and then storing in MySQL)?
I had intended on calculating duration for each time an activity is done. Duration would be calculated in PHP then inserted into a MySQL DB (along with other data such as start time, end time, user, activity, etc).
But, based on this question Database normalization: How can I tabulate the data? I got the impression that rather than record duration at the time of insert, I should calculate it based on the start and end values saved in the MySQL DB.
Does this sound right? If yes, can someone explain why? Are there any recommended ways of calculating duration for values stored in MySQL?
EDIT:
After a user completes an activity online, the start and finish time for that activity is inserted into the DB. I was going to use these values to calculate duration (either in MySQL or prior to insertion (using PHP). Duration would later be used for other calculations.

I assume you have a start_time and an end_time as basis for your duration, both of which will be stored in the database anyway? Then yes, there's hardly an advantage to storing the duration in the database as well. It's only duplicated data that you are storing already anyway (duration = end - start, which is a really quick calculation), so why store it again? Furthermore, that only allows for the data to go out of sync. Say some bug causes the end_time to be updated, but not the duration. Now you have inconsistent data with no real way to know which is correct.

I think that it depends on the size of the database, server load, etc... I have had instances where processing in PHP is faster, whereas other times processing in MySQL. There are lots of factors that could affect performance.
However, the thing to keep in mind is that you want to avoid multiple database calls. If you are going to try this in PHP, and loop through each record and do an update per record, I think that the number of mysql calls could hinder performance. However, if you calculate the duration in PHP prior to the insert, then it makes sense. If the data is already in the database, then perhaps a single update statement would be the best option.
Just my 2c

In my opinion this depends mostly on the situation, so maybe add a little more details to your post in order to better understand what you're aiming at.
If your program has alot of database-related actions, and the
database server is slower than your PHP server, and it is about
thousands and thousands of calculations, it may be better to
calculate this in your PHP code.
If your program doesn't leaves the
database very much alone, and your code is already doing alot of
work, maybe then it would be slightly better to let the database do
the job.
If you've already stored start- and end-time in your table,
storing the duration would be a usually not necessary overhead (could
be done anyway for the reason to improve performance if database
space ain't an issue).
But, taking all of this into consideration, I don't think this decision is critical for most applications, it is most likely more a question of personal flavour and preference.

I think, that it should be better to create 2 separate fields in MySQL rahter than calculate the duration in PHP.
And the reasons
While it may be true, that MySQL will have to calculate it upon every retrieval, it is also true, that MySQL is very good at this. With a creation of a well made index, this should have no negative performance side-effects.
It gives you more data to work with. Lets say, you want to find out when users finished their particular action. If you kept only the duration, you would have to calculate this time again, thus making it prone to errors. Keeping another date may come in handy.
Also true, if you want to calculate some difference between activities of multiple users. In this case, a pre calculated value would be a pain in the a*s, since it would make you do more reverse calculations.
So in my opinion - add the separate fields. It is not a normalization problem, since you are not duplicating any data. Duration however would.

What is the best way of storing trend data?

I am currently building an application where I am importing statistical data for (currently) around 15,000 products. At current, if I was to maintain one database table for each day statistics from one source it would be increased by 15,000 rows of data (let's say 5-10 fields per row primarily float, int) per day. Obviously equating to over 5 million records per year into one table.
That doesn't concern me so much as the thought of bringing in data from other sources (and thus increasing the size the database by 5 million records for each new source).
Now the data is statistical / trending based data, and will have basically 1 write per day per record, and many reads. For purposes of on the fly reporting and graphing however I need fast access to subsets of the data based on rules (date ranges, value ranges, etc).
What my question is, is this the best way to store the data (MySQL InnoDb tables), or is there a better way to store and handle statistical/trend data?
Other options I have tossed around at this point:
1. Multiple databases (one per product), with separate tables for each data source within.
(ie Database: ProductA, Table(s):Source_A, Source_B, Source_C)
2. One database, multiple tables (one for each product/data source)
(ie Database: Products, Table(s): ProductA_SourceA, ProductA_SourceB, etc.)
3. All factual or specific product information in the database and all statistical data in csv, xml, json, (flat files) in separate directories.
So far, none of these options are very manageable, each has its pros and cons. I need a reasonable solution before I move into the alpha stage of development.

You could try making use of a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
We've had good experience with InfiniDB:
http://infinidb.org/
and Infobright looks good as well:
http://www.infobright.com/
Both InfiniDB and Infobright have free open source community editions, so I would recommend using these to get some benchmarks on the kinds of performance benefit you might get.
You might also want to look at partitioning your data to improve performance.

It's a little bit dependent upon what your data looks like, and the kind of aggregations/trends you're looking to run. Most relational databases work just fine for this sort of chronological data. Even with billions of records, proper indexing and partitioning can make quick work work of finding the records you need. DB's like Oracle, MySQL, SQL-Server fall within this category.
Lets say the products you work with are stocks, and for each stock you get a new price every day (a very realistic case). New exchanges, stocks, trade frequencies will grow this data exponentially pretty quickly. You could however partition the data by exchange. Or region.
Various Business Intelligence tools are also able to assist in, what effectively amounts to pre-aggregating data prior to retrieval. This is basically a Column-oriented database as was suggested. (Data Warehouses and OLAP structures can assist in massaging and aggregating data sets ahead of time).
Similar to the idea of data warehousing, if it's just a matter of the aggregations taking too long, you can work-off the aggregations overnight into a structure which is more quick to query from. In my previous example, you may only need to retrieve large chunks of data very infrequently, but more often some aggregation such as 52 week high. You can store the large amount of raw data in one format, and then every night have a job work off only what you need into a table which rather than thousands of data points per stock, now has 3 or 4.
If the trends you're tracking are really all over the place, or complex algorithms, a full fledged BI solution might be something to investigate so you can use pre-built analityic and data mining algorithms.
If the data is not very structured, you may have better luck with a NoSQL database like Hadoop or Mongo, although admittedly my knowledge of databases is more focused around relational formats.

Changing the data from relational to non-relational like graphs, Converting data to better and organized forms like using Data Marts and Data Lakes. Using Data Mining algorithms. Processing data together by using techniques like map reduce. Converting ACID properties to BASIC.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.