I have been tasked to write a complicated invoicing and billing system with reports. The structure of the database is such that returning the invoices' totals and figures requires a complicated string of joins and calculations that is cumbersome to troubleshoot and highly sensitive to error.
Rather than produce a library of long and complicated queries throughout my reporting system, I opted to use a temporary table that will perform the necessary calculations and hold the figures next to a record ID number that I can join into when and where needed. This made coding the system much easier, but it also made the database noticeably slower as every page hit that needs to use this temp table must populate it first.
In summary, I've got a reporting system that must be able to display calculated figures from a series of several invoice tables. Can anyone offer advice on an approach or solution that would simplify this process without costing me server speed?
Hard to judge as your description is too generic.
One of the possible solutions: If the reports you are generating allow to be a little outdated, then you could create a regular table instead of a temporary one and populate it with all the info using an event scheduler as often as needed (e.g. once a day or once an hour).
Related
I'm developing a PHP dashboard with statistics from a lot of different MySql tables. There is need for many cumulatives and other totals for building charts etcetera. Some queries can be very simple, others join and compare many tables at once.
To do this properly I am considering 3 approaches;
1) Creating a master table which is constantly updated after each front-end (website) and back-end (CRM) interaction and contains only totals for different purposes. So the data is easily accessible with a simple select statement for building the main dashboard.
2) Using many and/or complex queries each time the dashboard is generated. This will take less developing time but more loading time. Maybe there is a better way to manage the order and execution of each query.
3) Creating cron jobs for updating the totals in the background. This is my least favourite approach because is feels outdated for multiple reasons.
Could someone advise me and explain what the best approach would be for the long term?
Thanks in advance.
Fred.
1st of all there is no "correct approach", but only one that would fit your needs
In fitting your needs 2 things must be considered:
Your business requirement (real time, daily updated etc)
Scalability of the code maintenance
As per your case, I would combine the options.
1st of all, I would create a dedicated database with dedicated db, which is good for performance, and historical data saving (which might change, if you take your 2nd approach)
Under that condition, the question whether to update the dashboard online or via a cron job: very dependent on the business need. I think a cron job is better
1st of all, in's scalable - you can ditch it in the future, and update only the dashboard...
2nd of all, you can time it to run during "slow hours" preventing overloading your production servers. Different cron jobs can update different tables in a different frequency etc.
This is of course, my opinion. Hop it helps
I have trouble understanding specific things and I will share my experiments:
From what I have learned from my previous jobs where the directions from the CEO included: all mysql queries must include parameters I need and query them all at once to get the accurate data, am I right? So this will do a perfect job if I had an e-commerce store where I load 10 items for a query. My conclusion here is - this will work great for pages where I don't need to load a lot of data all at once. doesn't matter how big is my database. (considered: good practice)
I have a website where I have to return a (long, big, heavy, hell of a) CSV report. I wrote all the queries required with INNER JOINs -
BUT this time my database includes millions of rows, and my report loops 50,000 times (through each customer I have), and uses INNER JOINs to gather data out of 4~6 tables which include millions of rows each. and in some even does some calculations. All the system is, off-course, OOP, so each object of a single user (which is a query by itself).
So my code has a lot of small queries to the database requesting data for each users, and while looping, having around 4~6 big-INNER JOIN queries. This took a few minutes to run.
I thought it doesn't make sense so I decided to experiment with it.
I decided to separate everything, and not to use everything as an "object" but rather get all data at once from the table, without any joins, and manage it via PHP. So I got all users with the relevant data to 1 $users array. then got more data from table A. then from table B. organized table B. got data from table C. made calculations on table C. ect'. Looped again while matching my data to one final array, and outputting to a csv.
AND THIS TOOK LESS THAN 1 minute to run!
Instead of eating memory out of my database, it did affect the CPU. and after re-writing my code for more efficiency - it took less than 30 secs. and didn't have any CPU bumps.
If all my code is based on OOP and this way of "direct" scripting works faster: Is that ok to continue with it for specific big heavy outputs? (in terms of "good or bad" practice).
PS: I would have used summary tables but that's not what the CEO wants for now.
PS2: Tables are indexed properly.
Regarding your CEOs advice on parameterizing the queries, he’s correct that it will be faster. Adding params will let the server plan the query and cache the plan for future use, saving execution time. It also helps prevent sql injection attacks by properly handling any ill-intentioned user text. Doing everything in a single query is best because the overhead of connecting and issuing the query can add significant time to the query.
I surprises me that you got such a significant performance increase with the manual coding. You may want to analyze your DB structure to make sure there’s not a more efficient way to organize or query the data.
However, writing custom code to do the data construction is a valid approach as long as you are willing to take on the additional responsibility of maintaining it. I assume there was a significantly larger investment of time to write and debug that code than there was in writing the query. There will probably be an equal amount of time necessary to maintain it when requirements change.
((Opinion))
Usually it is possible to improve the query enough to make it faster than shoveling data back and forth between the database and the application.
Better indexes (especially 'composite' indexes)
Reformulating the SELECT
Normalizing, but not overnormalizing
Building and maintaining "Summary" tables - In some situations this speeds up the query 10-fold.
I am new to PHP (and CakePHP & web development in general) and am currently trying to build a parking management application.
In brief, I have a table that is indexed with the Parking space ID and it contains the current occupancy number i.e. how many slots are in use.
When a car enters the parking lot, I want to increment this number and vice versa when the car leaves.
This is one of the 5-7 things I have to do when a car enters the parking lot.
The issue is I am thinking (maybe too early to think this?) that updating the table(writing to db) at every entry/exit would be too inefficient.
A trivial C++ implementation would be to maintain this as in-memory two dimensional array and register signal handlers to write the array to the db if and when the binary dies.
Is this possible in CakePHP? Is my efficiency concern even legitimate at this early stage of implementation?
Please advise.
Generally it is recommended to first make it work, then make it right, and, finally, make it fast
For step 3 I would suggest looking at the following factors:
What operations need to be fast (and what means fast in this context)? (not only SQL, more a method or service)
Am I meeting the given performance values? (e.g. Request Times, CPU Load...)
How can I optimize the behaviour where I need to? (Caching, SQL Optimization, configuration etc)
Which is the most efficient and effective way to achieve this? Does it pay out (Costs vs Profit)?
Technically speaking I would consider the following:
The increment/decrement should be atomic or in a transaction to avoid dirty reads
Consider that indexes (generally) reduce speed of insert (and might on update)
Consider that an update searches for the row first (hopefully by index) and then changes the row
Maybe it it faster to write along a "log-like" table depending on the number of SELECTs
Try to avoid recursive requests in CakePHP from the beginning
A redis cache can easily be set up together with CakePHP and might speed up your application
For CakePHP optimization in general mark wrote a good blog post on this issue.
Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*
I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!
As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.
Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.