How to approach multi-million data selection

How to approach multi-million data selection - php

I have a table that stores specific updates for all customers.
Some sample table:
record_id | customer_id | unit_id | time_stamp | data1 | data2 | data3 | data4 | more
When I created the application, I did not realize how much this table would grow -- currently I have over 10mil records within 1 month. I am facing issues, when php stops executing due to amount of time it takes. Some queries produce top-1 results, based on the time_stamp + customer_id + unit_id
How would you suggest handling this type of issues? For example, I can create new table for each customer, although I think it does not a good solution.
I am stuck with no good solution in mind.

If you're on the cloud (where you're charged for moving data between server and db), ignore.
Move all logic to the server
The fastest query is a SELECT WHEREing the PRIMARY. It won't matter how large your database is, it will come back just as fast with a table of 1 row (as long as your hardware isn't unbalanced).
I can't tell exactly what you're doing with your query, but first download all of the sorting and limiting data into PHP. Once you've got what you need, SELECT the data directly WHEREing on record_id (I assume that's your PRIMARY).
It looks like your on demand data is pretty computationally intensive and huge, so I recommend using a faster language. http://blog.famzah.net/2010/07/01/cpp-vs-python-vs-perl-vs-php-performance-benchmark/
Also, when you start sorting and limiting on the server rather than the db, you can start identifying shortcuts to speed it up even further.
This is what the server's for.

I suggest you use partitioning of your data following some criteria.
You can make horizontal or vertical partition of your data.
For example group your customer_id in 10 partitions, using his id module 10.
So, customer_id terminated in 0 goes to partition 0, with ended in 1 goes to partition 1
MySQL can make this for you easily.

What is the count of records within the tables? Often, with relational databases, it's not how much data you have (millions are nothing to relational databases), it's how you're retrieving it.
From the look of your select, in fact, you probably just need to optimize the statement itself and avoid the multiple subselects, which is probably the main cause of the slowdown. Try running an explain on that statement, or just get the ids and run the interior select individually on the ids of the records that you've actually found & retrieved in the first run.
Just the fact that you have those subselects within your overall statement means that you haven't optimized that far into the process anyway. For example, you could be running a nightly or hourly cron job that aggregates into a new table the sets like the one created by SELECT gps_unit.idgps_unit, and then you can run your selects against a previously generated table instead of creating blocks of data that are equivalent of a table on the fly.
If you find yourself unable to effectively optimize that select statement, you have "final" options like:
Categorize via some criteria and split into different tables.
Keep a deep archive, such that anything past the first year or so is migrated to a less used table and requires special retrieval.
Finally, if you have so much small data, you may be able to completely archive certain tables and keep them around in file form only and then truncate past a certain date. Often with web tracking data that isn't that important and is kinda spammy, I end up doing this after a few years, when the data is really not going to do anyone any good any more.

Related

PHP: Filtering and export large amount of data from MySQL database

I have a very large database table (more than 700k records) that I need to export to a .csv file. Before exporting it, I need to check some options (provided by the user via GUI) and filter the records. Unfortunately this filtering action cannot be achieved via SQL code (for example, a column contains serialized data, so I need to unserialize and then check if the record "passes" the filtering rules.
Doing all records at once leads to memory limit issues, so I decided to break the process in chunks of 50k records. So instead of loading 700k records at once, I'm loading 50k records, apply filters, save to the .csv file, then load other 50k records and go on (until it reaches the 700k records). In this way I'm avoiding the memory issue, but it takes around 3 minutes (This time will increase if the number of records increase).
Is there any other way of doing this process (better in terms of time) without changing the database structure?
Thanks in advance!

The best thing one can do is to get PHP out of the mix as much as possible. Always the case for loading CSV, or exporting it.
In the below, I have a 26 Million row student table. I will export 200K rows of it. Granted, the column count is small in the student table. Mostly for testing other things I do with campus info for students. But you will get the idea I hope. The issue will be how long it takes for your:
... and then check if the record "passes" the filtering rules.
which naturally could occur via the db engine in theory without PHP. Without PHP should be the mantra. But that is yet to be determined. The point is, get PHP processing out of the equation. PHP is many things. An adequate partner in DB processing it is not.
select count(*) from students;
-- 26.2 million
select * from students limit 1;
+----+-------+-------+
| id | thing | camId |
+----+-------+-------+
| 1 | 1 | 14 |
+----+-------+-------+
drop table if exists xOnesToExport;
create table xOnesToExport
( id int not null
);
insert xOnesToExport (id) select id from students where id>1000000 limit 200000;
-- 200K rows, 5.1 seconds
alter table xOnesToExport ADD PRIMARY KEY(id);
-- 4.2 seconds
SELECT s.id,s.thing,s.camId INTO OUTFILE 'outStudents_20160720_0100.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM students s
join xOnesToExport x
on x.id=s.id;
-- 1.1 seconds
The above 1AM timestamped file with 200K rows was exported as a CSV via the join. It took 1 second.
LOAD DATA INFILE and SELECT INTO OUTFILE are companion functions that, for one one thing, cannot be beat for speed short of raw table moves. Secondly, people rarely seem to use the latter. They are flexible too if one looks into all they can do with use cases and tricks.
For Linux, use LINES TERMINATED BY '\n' ... I am on a Windows machine at the moment with the code blocks above. The only differences tend to be with paths to the file, and the line terminator.

Unless you tell it to do otherwise, php slurps your entire result set at once into RAM. It's called a buffered query. It doesn't work when your result set contains more than a few hundred rows, as you have discovered.
php's designers made it use buffered queries to make life simpler for web site developers who need to read a few rows of data and display them.
You need an unbuffered query to do what you're doing. Your php program will read and process one row at a time. But be careful to make your program read all the rows of that unbuffered result set; you can really foul things up if you leave a partial result set dangling in limbo between MySQL and your php program.
You didn't say whether you're using mysqli or PDO. Both of them offer mode settings to make your queries unbuffered. If you're using the old-skool mysql_ interface, you're probably out of luck.

which is faster mysql(querys) or php switch statements?

i was wonder if anyone could point me in the right direction . i was wondering which is faster ..... i have a situation on a game im creating were there are over 630,000 combinations and i wanted to know is if my script is going to seek the database for one result , would be quicker then say a large switch statement? .... this game im creating is (im hoping) suppose to be a lightwight hit and i dnt want any problems
<?php
// is this quicker
mysql_query(....) - meanwhile remeber this table should have anywere from 600,000-630,000 rows
// or is this quicker
switch{
case --
....
case--
....
were here this will be in one page with anywwhere from 600,000 - 630,000 different case's ?
}
?>

php will take ages to parse the page, which will probably make it slower regardless of execution time, which may or may not be slower (thought I'd expect query to be faster here as well). You may also consider associative array instead of switch, if query can do that, but it won't make parsing much faster. And think of memory consumption too.
And you can just try.

I would almost certainly go with the MySQL query, especially if the table(s) are indexed correctly. The switch method will be very hard to maintain - for example, what if someone other than you needed to add a new combination to the switch?

PHP doesn't use switch tables to optimize case selections. It would be the equivalent of a gigantic if-elseif-else... statement.
It's better to use queries to properly indexed tables for this.

Ok, just read that, take it or leave it (don't know what your game looks like).
If that is a high traffic (many users) game/website, I would split data in at least two tables, one holding just ID, some GROUP information and what you call COMBINATION. All other additional data would then be in the second table, accessable over a JOIN to that ID.
table 1
ID | GROUP | COMBINATION
1 | island | ABCDE
2 | house | FGHIJ
table2
ID | MORE INFO
1 | ...
2 | ...
Also I would (if possible) split those GROUPs into table chunks.
// ok, this is an example for ID and range of IDs, but I think you can get it
partition by range (id)
(
PARTITION P1 VALUES LESS THAN (10),
PARTITION P2 VALUES LESS THAN (20)
)
Logical Splitting:
- No need to create separate tables
- no need to move chunks of data across files
4 main reasons to use partitions:
- to make single inserts and selcts faster
- to make range selects faster
- to help split the data across different paths
- to store historical data efficiently
- if you need to delete large chunks of data instantly

To serialize or to keep a separate table?

This question has risen on many different occasions for me but it's hard to explain without giving a specific example. So here goes:
Let's imagine for a while that we are creating a issue tracker database in PHP/MySQL. There is a "tasks" table. Now you need to keep track of people who are associated with a particular task (have commented or what not). These persons will get an email when a task changes.
There are two ways to solve this type of situation. One is to create a separate table tasks_participants:
CREATE TABLE IF NOT EXISTS `task_participants` (
`task_id` int(10) unsigned NOT NULL,
`person_id` int(10) unsigned NOT NULL,
UNIQUE KEY `task_id_person_id` (`task_id`,`person_id`)
);
And to query this table with SELECT person_id WHERE task_id='XXX'.
If there are 5000 tasks and each task has 4 participants in average (the reporter, the subject for whom the task brought benefit, the solver and one commenter) then the task_participants table would be 5000*4 = 20 000 rows.
There is also another way: create a field in tasks table and store serialized array (JSON or PHP serialize()) of person_id's. Then there would not be need for this 20 000 rows table.
What are your comments, which way would you go?

Go with the multiple records. It promotes database normalization. Normalization is very important. Updating a serialized value is no fun to maintain. With multiple records I can let the database do the work with INSERT, UPDATE and DELETE. Also, you are limiting your future joins by having a multivalued column.

Definitely do the cross reference table (the first option you listed). Why?
First of all, do not worry about the size of the cross reference table. Relational databases would have been out on their ear decades ago if they could not handle the scale of a simple cross reference table. Stop worrying about 20K or 200K records, etc. In fact, if you're going to worry about something like this, it's better to start worrying about why you've chosen a relational DB instead of a key-value DB. After that, and only when it actually starts to be a problem, then you can start worrying about adding an index or other tuning techniques.
Second, if you serialize the association info, you're probably opaque-ifying a whole dimension of your data that only your specialized JSON-enabled app can query. Serialization of data into a single cell in a table typically only makes sense if the embedded structure is (a) not something that contains data you would never need to query outside your app, (b) is not something you need to query the internals of efficiently (e.g., avg count(*) of people with tasks), and (c) is just something you either do not have time to model out properly or is in a prototypical state. So I say probably above, because it's not usually the case that data worth persisting fits these criteria.
Finally, by serializing your data, you are now forced to solve any computation on that serialized data in your code, which is just a big waste of time that you could have spent doing something more productive. Your database already can slice and dice that data any way you need, yet because your data is not in a format it understands, you need to now do that in your code. And now imagine what happens when you change the serialized data structure in V2.
I won't say there aren't use cases for serializing data (I've done it myself), but based on your case above, this probably isn't one of them.

There are a couple of great answers already, but they explain things in rather theoretical terms. Here's my (essentially identical) answer, in plain English:
1) 20k records is nothing to MySQL. If it gets up into the 20 million record range, then you might want to start getting concerned - but it still probably won't be an issue.
2) OK, let's assume you've gone with concatenating all the people involved with a ticket into a single field. Now... Quick! Tell me how many tickets Alice has touched! I have a feeling that Bob is screwing things up and Charlie is covering for him - can you get me a list of tickets that they both worked on, divided up by who touched them last?
With a separate table, MySQL itself can find answers to all kinds of questions about who worked on what tickets and it can find them fast. With everything crammed into a single field, you pretty much have to resort to using LIKE queries to find the (potentially) relevant records, then post-process the query results to extract the important data and summarize it yourself.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.

I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
You need to remember to create a new table every year or else your app breaks.
Querying aggregates against all rows regardless of year is harder.
Updating a date potentially means moving a row from one table to another.
It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).

Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.

Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year

so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.

Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.

Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.

My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.

Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.

what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. #thetaiko is right though and this is well within MySQL capabilities to deal with this.

I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.

I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.

I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?

Query Caching in MySQL

I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
ORDER BY kills DESC
LIMIT 10
OFFSET 30
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.

You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.

Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.

Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.