MYSQL table becoming large

MYSQL table becoming large - php

I have a table in which approx 100,000 rows are added every day. I am supposed to generate reports from this table. I am using PHP to generate these reports. Recently the script which used to do this is taking too long to complete. How can I improve the performance by shifting to something else than MYSQL which is scalable in the long run.

MySQL is very scalable, that's for sure.
The key is not changing the db from Mysql to other but you should:
Optimize your queries (can sound silly for others but I remember for instance that a huge improvment I've done sometime ago is to change SELECT * into selecting only the column(s) I need. It's a frequent issue I meet in others code too)
Optimize your table(s) design (normalization etc).
Add indexes on the column(s) you are using frequently in the queries.
Similar advices here

For generating reports or file downloads with large chunks of data you should concider using flush and increasing time_limit and memory limit.
I doubt the problem lies in the amount of rows, since MySQL can support ALOT of rows. But you can of course fetch x rows a time and process them in chunks.
I do assume your MySQL is properly tweaked for performance.

First analyse why (or: whether) your queries are slow: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
The example includes a table with 500 million rows with query times of 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
Hope you find this of interest.

Another thought is to move records beyond a certain age to a historical database for archiving, reporting, etc. If you don't need that large volume for transactional processing it might make sense to extract them from the transactional data store.
It's common to separate transactional and reporting databases.

I am going to make some assumptions
Your 100k rows added every day have timestamps which are either real-time, or are offset by a relatively short amount of time (hours at most); your 100k rows are added either throughout the day or in a few big batches.
The data are never updated
You are using InnoDB engine (Frankly you would be insane to use MyISAM for large tables because in the event of a crash, index rebuild takes a prohibitive time)
You haven't explained what kind of reports you're trying to generate, but I'm assuming that your table looks like this:
CREATE TABLE logdata (
dateandtime some_timestamp_type NOT NULL,
property1 some_type_1 NOT NULL,
property2 some_type_2 NOT NULL,
some_quantity some_numerical_type NOT NULL,
... some other columns not required for reports ...
... some indexes ...
);
And that your reports look like
SELECT count(*), SUM(some_quantity), property1 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property1;
SELECT count(*), SUM(some_quantity), property2 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property2;
Now, as we can see, both of these reports are doing a scan of a large amount of the table, because you are reporting on a lot of rows.
The bigger the time range becomes the slower the reports will be. Moreover, if you have a lot of OTHER columns (say some varchars or blobs) which you aren't interested in reporting on, then they slow your report down too (because the server still needs to inspect the rows).
You can use several possible techniques for speeding this up:
Add covering index for each type of report, to support the columns you need and omit columns you don't. This may help a lot but slow inserts down.
Summarise data according to the dimension(s) that you want to report on. In this ficticious case, all your reports are either counting rows, or SUM()ing some_quantity.
Build mirror tables (containing the same data) which have appropriate primary keys / indexes/ columns to make the reports faster.
Use a column engine (e.g. Infobright)
Summarisation is usually an attractive option if your use-case supports it;
You may wish to ask a more detailed question with an explanation of your use-case.

The time limit can be temporarily turned off for a particular file if you know that it is going to potentially run over the time limit by calling set_time_limit (0); at the start of your script.
Other considerations such as indexing or archiving very old data to a different table should also be looked at.

Your best bet is something like MongoDB or CouchDB, both of which are non-relational databases oriented toward storing massive amounts of data. This is assuming that you've already tweaked your MySQL installation for performance and that your situation wouldn't benefit from parallelization.

Related

MySQL for selecting MAXIMUM differences of two columns

I have a table with following columns:
ItemCode VARCHAR
PriceA DECIMAL(10,4)
PriceB DECIMAL(10,4)
The table has around 1,000 rows.
My requirement is to check the difference (PriceA-PriceB) for each row and then display top 50 items that have maximum price differences.
There are two ways I can implement this
1) Trust that SQL calculation is non-complex, easy and fast, so I run the following query:
SELECT ItemCode, (PriceA - PriceB) AS PDiff FROM testtable ORDER BY PDiff DESC LIMIT 50
and second,
2) Add one more column (called PriceDiff), which will store the difference (PriceA-PriceB).
However, these will have to be inserted manually and need extra space. But it can simply run the MAX(PriceDiff) select query for top 50.
My question is - in terms of speed and efficiency for a web application (displaying results on a website/app), which of the above method is better?
I have attempted to generate time consumed for each query, but both are reporting similar figures so unable to make any inferences.
Any explanation by the experts, or any fine-tuning of code, will be really appreciated.
Thanks

In general, to improve performance you always have to make a tradeoff between memory and time. Caching results will improve speed, however takes more memory. You can reduce memory usage by calculating stuff on the fly at the expense of performance.
In your case, storing additional 1000+ values in the DB is a matter of few extra Kb. Calculating the diff on the fly will have a negligible impact on performance. Either option is absolute peanuts to any DB and server.
I would stick with doing calculations on the fly as that is less complex and keeps the db normalized.

The first method is fastest, but is prone to error, as was mentioned.
May I suggest another solution, using a primary key. You could then set the value of the new column to what you are trying to figure from within the web application.
Then, when wanting to know the top 50, you could use your original method of finding the top 50, using your second method, where you would select from the table which stores the differences.
These links explain primary keys and how to use them:
http://www.mysqltutorial.org/mysql-primary-key/
https://www.w3schools.com/sql/sql_primarykey.asp

Performance - order by in MySQL or in PHP

I know this has been asked before at least in this thread:
is php sort better than mysql "order by"?
However, I'm still not sure about the right option here since the performance on doing the sorting on PHP side is almost 40 times faster.
This MySQL query runs in about 350-400ms
SELECT
keywords as id,
SUM(impressions) as impressions,
SUM(clicks) as clicks,
SUM(conversions) as conversions,
SUM(not_ctr) as not_ctr,
SUM(revenue) as revenue,
SUM(cost) as cost
FROM visits WHERE campaign_id = 104 GROUP BY keywords(it's an integer) DESC
Keywords and campaign_id columns are indexed.
Using about 150k rows and returns around 1500 rows in total.
The results are then recalculated (we calculate click through rates, conversion rates, ROI etc, as well as the totals for the whole result set). The calculations are done in PHP.
Now my idea was to store the results with PHP APC for quick retrieval, however we need to be able to order these results by the columns as well as the calculated values, therefore if I wanted to order by click-through rate I'd have to use
(SUM(clicks) / (SUM(impressions) - SUM(not_ctr)) within the query which makes it around 40ms slower and the initial 400ms is a really long time already.
In addition we paginate these results, but adding LIMIT 0,200 doesn't really affect the performance.
While testing the APC approach I executed the query, did the additional calculations and stored the array in memory so it would only be executed once during the initial request and that worked like a charm. Fetching and sorting the array from memory only took around 10ms, however the script memory usage was about 25mb. Maybe it's worth loading the results into a memory table and then querying that table directly?
This is all done on my local machine(i7, 8gb ram) which has the default MySQL install and the production server is a 512MB box on Rackspace on which I haven't tested yet, so if possible ignore the server setup.
So the real question is: Is it worth using memory tables or should I just use the PHP sorting and ignore the RAM usage since I can always upgrade the RAM? What other options would you consider in optimizing the performance?

In general, you want to do sorting on the database server and not in the application. One good reason is that the database should be implementing parallel sorts and it has access to indexes. A general rule may not be applicable in all circumstances.
I'm wondering if you indexes are helping you. I would recommend that you try the query:
With no indexes
With an index only on campaign_id
With both indexes
Indexes are not always useful. One particularly important factor is called "selectivity". If you only have two campaigns in the table, then you are probably better off doing a full-table scan rather than indirectly searching through an index. This because particularly important when the table does not fit into memory (resulting in a condition where every row requires load a page into cache).
Finally, if this is going to be an application that expands beyond your single server, be careful. What is optimal on a single machine may not be optimal in a different environment.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.

I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
You need to remember to create a new table every year or else your app breaks.
Querying aggregates against all rows regardless of year is harder.
Updating a date potentially means moving a row from one table to another.
It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).

Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.

Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year

so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.

Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.

Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.

My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.

Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.

what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. #thetaiko is right though and this is well within MySQL capabilities to deal with this.

I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.

I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.

I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?

PHP's in_array vs. MySQL SELECT

I need to check if some integer value is already in my database (which is growing all the time). And it should be done several thousand times in one script. I'm considering two alternatives:
Read all those numbers from MySQL database into PHP array and every time I need to check it, use in_array function.
Every time I need to check the number, just execute something like SELECT number FROM table WHERE number='#' LIMIT 1
On the one hand, searching in array which is stored in RAM should be faster than querying mysql every time (as I have mentioned, these checks are performed about a thousand times during one script execution). On the other hand, DB is growing, ant that array may become quite big and that may slow things down.
Question is - which way is faster or better by some other aspects?

I have to agree that #2 is your best choice. When performing a query with a LIMIT 1 MySQL stops the query when it finds the first match. Make sure the columns you intend to search by are indexed.

It sounds like you are duplicating a Unique Constraint in code...
CREATE TABLE MyTable(
SomeUniqueValue INT NOT NULL
CONSTRAINT MyUniqueKey UNIQUE (SomeUniqueValue));

How does the number of times you need to check compare with the number of values stored in the database? If it's 1:100 then your probably better of searching in the database each time, if it's (some amount) less then preloading the list will be faster. What happened when you tested it?
However even if the ratio is low enough for it to be faster loading the full table, this will gobble up memory and could, as a result, make everything else run more slowly.
So I would recommend not loading it all into memory. But if you can, then batch the checks up to minimise the number of round trips to the database.
C.

querying the database is the best option, one because you said the database is growing so that means new values are being added to the table, whereis in in_array you would be reading old values. Secondly, you might exhaust the RAM alloted to PHP with very large amount of data. Thirdly, mysql has its own query optimizers and other optimizations which makes it a far better choice as compared to php

Query Caching in MySQL

I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
ORDER BY kills DESC
LIMIT 10
OFFSET 30
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.

You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.

Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.

Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.