I'm developing a PHP dashboard with statistics from a lot of different MySql tables. There is need for many cumulatives and other totals for building charts etcetera. Some queries can be very simple, others join and compare many tables at once.
To do this properly I am considering 3 approaches;
1) Creating a master table which is constantly updated after each front-end (website) and back-end (CRM) interaction and contains only totals for different purposes. So the data is easily accessible with a simple select statement for building the main dashboard.
2) Using many and/or complex queries each time the dashboard is generated. This will take less developing time but more loading time. Maybe there is a better way to manage the order and execution of each query.
3) Creating cron jobs for updating the totals in the background. This is my least favourite approach because is feels outdated for multiple reasons.
Could someone advise me and explain what the best approach would be for the long term?
Thanks in advance.
Fred.
1st of all there is no "correct approach", but only one that would fit your needs
In fitting your needs 2 things must be considered:
Your business requirement (real time, daily updated etc)
Scalability of the code maintenance
As per your case, I would combine the options.
1st of all, I would create a dedicated database with dedicated db, which is good for performance, and historical data saving (which might change, if you take your 2nd approach)
Under that condition, the question whether to update the dashboard online or via a cron job: very dependent on the business need. I think a cron job is better
1st of all, in's scalable - you can ditch it in the future, and update only the dashboard...
2nd of all, you can time it to run during "slow hours" preventing overloading your production servers. Different cron jobs can update different tables in a different frequency etc.
This is of course, my opinion. Hop it helps
Related
I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.
What is a more optimized way of database designing?
I am building an application where different companies register and add their users and use the application. I need up to 10 tables per account. e.g
table1, table2, table3..table10
This is what I am doing currently, having 10 tables and storing all the information from all accounts into these. But I just got an idea which is expands my database design horizontally. So the idea is whenever someone create an company account, my app would dynamically create these tables like the following:
c_31_table1, c_31_table2, c_31_table3...c_31_table10
Where 31 is the example accountid of the company which joined? My assumption is because SQL is only vertically increasing otherwise, it would get slow as time goes on, with around 40k records or more per table in the future. So this approach will keep the vertical lenght of the database down, and scale horizontally.
Is this technique a good optimization technique?
Typically not a great idea. Every situation is different. With this you would not benefit from query cache (as much). Your queries would require more work (not much) up front. And stored procs, funcs, and events would require CONCATs, PREPARE, EXECUTE and DEALLOCATE PREPARE.
Proper indexing and multi-tenancy should be your focus until you have a problem, which may be a long while off. I would focus here and with EXPLAIN output as you develop. Profile your code. Always be profiling. Find out where routines are slow and fix them.
You will have a mumbo jumbo housekeeping mess with your proposed solution. They don't easily clean themselves up. Also, DDL calls are not cheap, even if on occasion.
If I worked at a place that suddenly wanted to name tables like
c_31_table1, c_31_table2, c_31_table3...c_31_table10
I would quit.
I'd consider such approach as a case of "early optimization".
With just 40k rows the individual "companies do not pose any significant load by itself. You would "poison" your application code with the effects of your optimization even if the current load would not need it.
Maybe you will need to introduce some optimizations when the numer of companies grows (likely not before you are far beyond 1000 companies). Then you might e.g. consider partitioning as a means for reducing "vertical" length.
But keep your application clean from such optimizations. You would pay a high price throughout the life of your application otherwise.
Another aspect:
Allowing DDL calls will expose priviledges that you might not want to be exposed.
I have been tasked to write a complicated invoicing and billing system with reports. The structure of the database is such that returning the invoices' totals and figures requires a complicated string of joins and calculations that is cumbersome to troubleshoot and highly sensitive to error.
Rather than produce a library of long and complicated queries throughout my reporting system, I opted to use a temporary table that will perform the necessary calculations and hold the figures next to a record ID number that I can join into when and where needed. This made coding the system much easier, but it also made the database noticeably slower as every page hit that needs to use this temp table must populate it first.
In summary, I've got a reporting system that must be able to display calculated figures from a series of several invoice tables. Can anyone offer advice on an approach or solution that would simplify this process without costing me server speed?
Hard to judge as your description is too generic.
One of the possible solutions: If the reports you are generating allow to be a little outdated, then you could create a regular table instead of a temporary one and populate it with all the info using an event scheduler as often as needed (e.g. once a day or once an hour).
I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!
As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.
Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.
I am currently in a debate with a coworker about the best practices concerning the database design of a PHP web application we're creating. The application is designed for businesses, and each company that signs up will have multiple users using the application.
My design methodology is to create a new database for every company that signs up. This way everything is sand-boxed, modular, and small. My coworkers philosophy is to put everyone into one database. His argument is that if we have 1000+ companies sign up, we wind up with 1000+ databases to deal with. Not to mention the mess that doing Business Intelligence becomes.
For the sake of example, assume that the application is an order entry system. With separate databases, table size can remain manageable even if each company is doing 100+ orders a day. In a single-bucket application, tables can get very big very quickly.
Is there a best practice for this? I tried hunting around the web, but haven't had much success. Links, whitepapers, and presentations welcome.
Thanks in advance,
The1Rob
I talked to the database architect from wordpress.com, the hosting service for WordPress. He said that they started out with one database, hosting all customers together. The content of a single blog site really isn't that much, after all. It stands to reason that a single database is more manageable.
This did work well for them until they got hundreds and thousands of customers, they realized that they needed to scale out, running multiple physical servers and hosting a subset of their customers on each server. When they add a server, it would be easy to migrate individual customers to the new server, but harder to separate data within a single database that belongs to an individual customer's blog.
As customers come and go, and some customers' blogs have high-volume activity while others go stale, the rebalancing over multiple servers becomes an even more complex maintenance job. Monitoring size and activity per individual database is easier too.
Likewise doing a database backup or restore of a single database containing terrabytes of data, versus individual database backups and restores of a few megabytes each, is an important factor. Consider: a customer calls and says their data got SNAFU'd due to some bad data entry, and could you please restore the data from yesterday's backup? How would you restore one customer's data if all your customers share a single database?
Eventually they decided that splitting into a separate database per customer, though complex to manage, offered them greater flexibility and they re-architected their hosting service to this model.
So, while from a data modeling perspective it seems like the right thing to do to keep everything in a single database, some database administration tasks become easier as you pass a certain breakpoint of data volume.
I would never create a new database for each company. If you want a modular design, you can create this using tables and properly connected primary and secondary keys. This is where i learned about database normalization and I'm sure it will help you out here.
This is the method I would use. SQL Article
I'd have to agree with your co-worker. Relational databases are designed to handle large amounts of data, and the numbers you're talking about (1000+ companies, multiple users per company, 100+ orders/day) are well within the expected bounds. Separate databases means:
multiple database connections in each script (memory and speed penalty)
maintenance is harder (DB systems generally do not provide tools for acting on databases as a group) so schema changes, backups, and similar tasks will be more difficult
harder to run queries on data from multiple companies
If your site becomes huge, you may eventually need to distribute your data across multiple servers. Deal with that when it happens. To start out that way for performance reasons sounds like premature optimization.
I haven't personally dealt with this situation, but I would think that if you want to do business intelligence, you should aggregate the data into an offline database that you can then run any analysis you want on.
Also, keeping them in separate databases makes it easier to partition across servers (which you will likely have to do if you have 1000+ customers) without resorting to messy replication technologies.
I had a similar question a while back and came to the conclusion that a single database is drastically more manageable. Right now, we have multiple databases (around 10) and it is already becoming a pain to manage especially when we upgrade the code. We have to migrate every single database.
The upside is that the data is segregated cleanly. Due to the sensitivity of our data, this is a good thing, but it does make it quite a bit more difficult to keep up with.
The separate database methodology has a very big advance over the other:
+ You could broke it up into smaller groups, this architecture scales much better.
+ You could make stand alone servers in an easy way.
That depends on how likely your schemas are to change. If they ever have to change, will you be able to safely make those changes to 1000 separate databases? If a scalability problem is found with your design, how are you going to fix it for 1000 databases?
We run a SaaS (Software-as-a-Service) business with a large number of customers and have elected to keep all customers in the same database. Managing 1000's of separate databases is an operational nightmare.
You do have to be very diligent creating your data model and the business objects / reporting queries that access them. One approach you may want to consider is to carry the company ID in every table and ensure that every WHERE clause includes the company ID for the currently logged-in user. If you use a data access layer, you can enforce that condition there.
As you grow large, you can still vertically partition by placing groups of companies on each physical server, e.g. the first 100 companies on Server A, the next 100 companies on Server B.