Importing and reporting on larger datasets in a MySQL cluster - php

we have a MySQL (mariaDB/Galera) cluster containing several billion unique data points in one table.
We need to migrate that table to a new one sorting out doublicate entries which takes a very long time and we are constrained in that regard. The next step would be to genereate reports for a given time window and UUID of the correspoinding NAS (a router in the real world/a location) as well as unique IDs (MACs) of users that are recurrent or switch NASes
The MySQL (mariaDB/Galera) DB right now ist about 25GB in size which should not be an issue. But the queries for reports on UIDs/MACs of users in combination with UUIDs NASes/locations takes a very long time.
The table structure is layed out as depicted here. One is the actual table and two would be a possible optimization. But I really don't know if that would do anything.
Is our DB approach the right one or should we use a different one (DB, table structure, stack, whatever ..) (open for suggestions)
The query for the migration (which is very slow) is the following:
INSERT INTO `metric_macs` m
(`uuid`,`shortname`,`mac`,`start`,`stop`,`duration`)
VALUES
SELECT uuid, shortname, mac, a, b, duration
FROM import i
ON DUPLICATE KEY update m.id = m.id
Query for unique users:
SELECT DISTINCT mac FROM `metric_macs` WHERE uuid in ('xxxx','yyyyy') and ( start BETWEEN '2020-01-01' and '2020-02-01' or stop BETWEEN '2020-01-01' and '2020-02-01') ;
Count of all datasets
Query for recurrent users:
SELECT id FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
HAVING COUNT(*) > 1
Count of all datasets
Query for unique location switching users:
SELECT uuid,mac FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
After that php is used to count all users with more than two distinct UUIDS.
The list is updated every 15 minutes with a list of UIDs (MACs) that are connected to a NAS, that list is checked for activity of a given UID(MAC) in the last 20 minutes. If there was we update the stop count of the last entries an add 15 minutes and start the calculation gain.
Sorry for the mess. We are fairly new to this kind of report generation. What are the possible ways to optimize the database or queries for near instant reporting?
Thanks!
Edit:
CREATE TABLE `metric_macs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`shortname` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`mac` varchar(80) COLLATE utf8mb4_unicode_ci NOT NULL,
`start` datetime NOT NULL,
`stop` datetime NOT NULL,
`duration` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `metric_macs_uuid_index` (`uuid`),
KEY `metric_macs_mac_index` (`mac`),
KEY `metric_macs_start_stop_index` (`start`,`stop`),
KEY `metric_macs_uuid_start_stop_index` (`uuid`,`start`,`stop`),
KEY `metric_macs_uuid_stop_index` (`uuid`,`stop`)
) ENGINE=InnoDB AUTO_INCREMENT=357850432 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

It is good to move away from 36-byte ids. However, don't stop at 8 bytes; you probably don't need more than 4 bytes (INT UNSIGNED, max or 4 billion) or 3 (MEDIUMINT UNSIGNED, max of 16M).
While you are at it, remove the dashes and unhex the uuids so they can fit in BINARY(16) (16 bytes).
I think you have 3 problems to tackle in the conversion:
Efficiently changing the current schema to a better one. Assuming this has old, unchanging, rows, you can do this in the background.
Quickly finishing the final step. (We will actually do this last.)
Changing the ingestion to the new format.
Step 0: Grab the latest timestamp so you know where to do steps 2 and 3 after spending time doing step 1.
Step 1: To build users and stations, it might be simply
INSERT INTO users (user_id)
SELECT UUID2BIN(userID)
FROM ( SELECT DISTINCT userID ) FROM log;
(and similarly for stations)
See this for converting uuids: http://mysql.rjweb.org/doc.php/uuid
That may take some time, but it does the de-dup efficiently.
Let me discuss step 3 before filling in step 2.
Step 3: If the ingestion rate is "high", see this for details on a ping-ponging staging table and for bulk normalization, etc:
http://mysql.rjweb.org/doc.php/staging_table
However, your ingestion rate might be not that fast. Do not use IODKU with a trick of using LAST_INSERT_ID to get the id from users and stations. It will "burn" ids and threaten to overflow your INT/MEDIUMINT id. Instead, see the link above.
Inserting into time_table, if no more than 100 per second (HDD) or 1000 per second (SSD), can be a simple INSERT while you get the necessary ids
INSERT INTO time_table (user_id, station_id, start_time, stop_time)
VALUES (
( SELECT id FROM users WHERE userID = uuid2bin('...') ),
( SELECT id FROM stations WHERE userID = uuid2bin('...') ),
'...', '...'
);
Back to step 2. You have saved a bunch of rows in the old table. And you saved the starting date for those. So do the bulk normalization and mass insert from log as if it were a "staging table" as discussed in my link.
That should allow you to convert with zero downtime and only a small amount of time when the new table is "incomplete".
I have not covered why the "reports take a long time". I need to see the SELECTs. Meanwhile, here are two thoughts:
If you build the new INT-like ids, sort them by date so that they are at least somewhat ordered chronologically ordered, therefore better clustered for some types of queries.
In general, building and maintaining a "summary table" allows reports to be run much faster. See http://mysql.rjweb.org/doc.php/summarytables
"Query for recurrent users:" has multiple query performance problems. Unless my approach is not adequate, I don't want do get into the details.

Related

MySQL Select Query to Widen Table Optimization

I have a MySQL table with the following configuration:
CREATE TABLE `MONITORING` (
`REC_ID` int(20) NOT NULL AUTO_INCREMENT,
`TIME` int(11) NOT NULL,
`DEVICE_ID` varchar(30) COLLATE utf8_unicode_ci NOT NULL,
`MON_ID` varchar(10) COLLATE utf8_unicode_ci NOT NULL,
`TEMPERATURE` float NOT NULL,
`HUMIDITY` float NOT NULL,
PRIMARY KEY (`REC_ID`),
KEY `SelectQueryIndex` (`TIME`,`MON_ID`))
ENGINE=MyISAM AUTO_INCREMENT=102069 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci
Multiple Monitoring Devices send data, always exactly on the minute, but not all monitors are always online. I am using PHP to query the database and format the data to put into a Google Line Chart.
To get the data into the Google Chart I am running a SELECT Query which is giving the results with all of the MON_ID's on a single line.
The Query I am currently using is:
SELECT `TIME`, `H5-C-T`, `P-C-T`, `H5-C-H`, `P-C-H`, `A-T`, `A-H` FROM
(SELECT `TIME`, `TEMPERATURE` as 'H5-C-T', `HUMIDITY` as 'H5-C-H' FROM `MONITORING` where `MON_ID` = 'H5-C') AS TAB_1,
(SELECT `TIME` as `TIME2`, `TEMPERATURE` as 'P-C-T', `HUMIDITY` as 'P-C-H' FROM `MONITORING` where `MON_ID` = 'P-C') AS TAB_2,
(SELECT `TIME` as `TIME3`, `TEMPERATURE` as 'A-T', `HUMIDITY` as 'A-H' FROM `MONITORING` where `MON_ID` = 'Ambient') AS TAB_3
WHERE TAB_1.TIME = TAB_2.TIME2 AND TAB_1.TIME = TAB_3.TIME3
The results are exactly what I want (Table with TIME and then a Temp and RH column for each of the three monitors), but seems like the query is taking a lot longer than it should to give the results.
Opening the full table, or selecting all rows of just one monitoring device takes about 0.0006 seconds (can't ask for much better than that).
If I do the query with 2 of the monitoring devices it takes about 0.09 seconds (still not bad, but a pretty big percentage increase).
When I put in the third monitoring device the query goes up to about 2.5 seconds (this is okay now, but as more data is collected and more of the devices end up needing to be in charts at one time, it is going to get excessive pretty quick).
I have looked at a lot of posts where people were trying to optimize their queries, but could not find any which were doing the query the same way as me (maybe I am doing it a bad way...). From the other things people have done to improve performance I have tried multiple indexing methods, made sure to check, analyze, and optimize the table in PHP MyAdmin, tried several other querying methods, changed sort field / order of the table, etc. but have not been able to find another way to get the results I need which was any faster.
My table has a total of a little under 100,000 total rows, and it seems like my query speeds are WAY longer than should be expected based off of the many people I saw doing queries on tables with tens of millions of records.
Any recommendations on a way to optimize my query?
Maybe the answer is something like multiple MySQL queries and then somehow merge them together in PHP (I tried to figure out a way to do this, but could not get it to work)?
Flip things inside out; performance will be a lot better:
SELECT h.`TIME`,
h.TEMPERATURE AS 'H5-C-T',
p.TEMPERATURE AS 'P-C-T',
h.HUMIDITY AS 'H5-C-H',
p.HUMIDITY AS 'P-C-H',
a.TEMPERATURE AS 'A-T',
a.HUMIDITY AS 'A-H'
FROM MONITORING AS h
JOIN MONITORING AS p ON h.TIME = p.TIME
JOIN MONITORING AS a ON a.TIME = h.TIME
WHERE h.`MON_ID` = 'H5-C'
AND p.`MON_ID` = 'P-C'
AND a.`MON_ID` = 'Ambient'
And use JOIN...ON syntax.
Is the combination of TIME and REC_ID unique? If so, performance would be even better if you switched to InnoDB, got rid of REC_ID and changed KEY SelectQueryIndex (TIME,MON_ID) into PRIMARY KEY(TIME, MON_ID).
You should also consider switching to InnoDB.

DELETE performance hit on a large MySQL table with index

Let's say we have a web forum application with a MySQL 5.6 database that are accessed 24/7 by many many users. Now there is a table like this for metadata of notifications sent to users.
| notifications | CREATE TABLE `notifications` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned NOT NULL,
`message_store_id` bigint(20) unsigned NOT NULL,
`status` varchar(10) COLLATE ascii_bin NOT NULL,
`sent_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`,`sent_date`)
) ENGINE=InnoDB AUTO_INCREMENT=736601 DEFAULT CHARSET=ascii COLLATE=ascii_bin |
This table has 1 million rows. With this table, a certain message_store_id becomes suddenly ineffective for some reason and I'm planning to remove all of records with that message_store_id with a single delete statement like
DELETE FROM notifications WHERE message_store_id = 12345;
This single statement affects 10% of the table since this message was sent to so many users. Meanwhile this notifications tables are accessed all the time by thousands of users, so the index must be present. Apparently index recreation is very costly when deleting records, so I'm afraid to do that and cause down time by maxing out the server resources. However, if I drop the index, delete the records then add an index again, I have to shut down the database for some time, unfortunately it is not possible for our service.
I wish MySQL 5.6 is not so stupid that this single statement can kill the database, but I guess it's very likely. My question is, is the index recreation really fatal for a case like this? If so, is there any good strategy for this operation that doesn't require me to halt the database for the maintenance?
There can be a lot of tricks/strategies you could employ depending on details of your application.
If you plan to do these operations on a regular basis (e.g. it's not a one-time thing), AND you have few distinct values in message_store_id, you can use partitions. Partition by value of message_store_id, create X partitions beforehand (where X is some reasonable cap on the amount of values for the id), and then you can delete all the records in that partition in an instant by truncating that partition. A matter of milliseconds. Downside: message_store_id will have to be a part of primary key. Note: you'll have to create partitions beforehand, because the last time I worked with them, alter table add partition re-created the entire table, which is a disaster on large tables.
Even if the alter table truncate partition does not work for you, you can still benefit from partitioning. If you issue a DELETE on the partition, by supplying corresponding where condition, the rest of the table will not be affected/locked by this DELETE op.
Alternative way of deleting records without locking the DB for too long:
while (true) {
// assuming autocommit mode
delete from table where {your condition} limit 10000;
// at this moment locks are released and other transactions have a chance
// to do some stuff.
if (affected rows == 0) {
break;
}
// This is a good place to insert sleep(5) to give other transactions
// more time to do their stuff before the next chunk gets deleted.
}
One option is to perform the delete as several smaller operations, rather than one huge operation.
MySQL provides a LIMIT clause, which will limit the number of rows matched by the query.
For example, you could delete just 1000 rows:
DELETE FROM notifications WHERE message_store_id = 12345 LIMIT 1000;
You could repeat that, leaving a suitable window of time for other operations (competing for
locks on the same table) to complete. To handle this in pure SQL, we can use the MySQL SLEEP() function, to pause for 2 seconds, for example:
SELECT SLEEP(2);
And obviously, this can be incorporated into a loop, in a MySQL procedure, for example, continuing to loop until the DELETE statement affects zero rows.

Best log database structure

MYSQL/PHP, I want to create a record of activities that people perform on the site.
Table ADDED -> EventID, UserID, Time, IP
Table DELETED -> EventID, UserID, Time, IP
Table SHARED -> EventID, UserID, Time, IP.
Is it more efficient to join these tables when querying to read for example the last 10 actions performed by a USERID, or would it be more efficient to structure like this.
Table EVERYTHING -> EventID, EventType(eg ADDED, DELETED, SHARED), UserID, Time, IP
Use one table which logs all events and differentiates the event type, as in your second suggestion.
You are storing only one type of data here, and it is therefore appropriate to store it in one table. In the early stages, you ought not worry too much about the size the table will grow to over time. Having only a few columns in a table like this, it can easily grow to many millions of rows before you would even need to consider partitioning it.
If you have a limited number of event types, you might consider using the ENUM() data type for the EventType column.
Using one table is the right thing to do because it is properly normalized. Adding a new event type should not require a new table. It's also much easier to maintain referential integrity and make use of indexes for retrieving and sorting all events for a user. (If you had them in separate tables, getting all events for a user and sorting them by time could be much, much slower than using one table!)
There are ways you can make these tables smaller, though, to save space and keep your indexes small:
Use an enum() to define your event types. If you have a small number of events, you use at most one byte per row.
Use an UNSIGNED integer type to get more EventID and UserIDs out of the same number of bytes.
If you don't need the full range of dates (likely), use a TIMESTAMP type to save 4 bytes per row vs a DATETIME type.
If you are only using ipv4 addresses, store the IP as an unsigned 4-byte integer and use INET_ATON() and INET_NTOA() to convert back and forth. This is the biggest winner here: a VARCHAR type would take at least 16 bytes, and you could potentially use a fixed row length format.
I recommend a table format like this:
CREATE TABLE Events (
`EventID` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`UserID` MEDIUMINT UNSIGNED NOT NULL COMMENT 'this allows a bit more than 16 million users, and your indexes will be smaller',
`EventType` ENUM('add','delete','share') NOT NULL,
`Time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`IP` INTEGER UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (`EventID`),
FOREIGN KEY (`UserID`) REFERENCES `Users` (`UserId`) ON UPDATE CASCADE ON DELETE CASCADE,
KEY (UserID)
);
If you store this using MyISAM, your row length will be 16 bytes, using a fixed format. This means every million rows requires 16MB of space for the data, and probably half that for indexes (depending on what indexes you use). This is so compact that mysql can probably keep the entire working portion of the table in memory most of the time.
Then it's an issue of creating the indexes you need for the operations that are most common. For example, if you always show all a user's events in a certain time range, replace KEY (UserID) with INDEX userbytime (UserID, Time). Then queries which are like SELECT * FROM Events WHERE UserID=? AND Time BETWEEN ? AND ? will be very fast.

Most efficient way to store data for a graph

I have come up with a total of three different, equally viable methods for saving data for a graph.
The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.
Method 1:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`category` ENUM('buildings','items',...) NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`userid`, `date`, `category`),
INDEX `userid` (`userid`),
INDEX `date` (`date`)
) ENGINE=InnoDB
This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:
DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)
Method 2:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`buildings-1day` FLOAT UNSIGNED NOT NULL,
`buildings-2day` FLOAT UNSIGNED NOT NULL,
... (and so on for each category up to `-7day`
PRIMARY KEY (`userid`)
)
Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:
... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...
Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.
Method 3:
Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.
Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.
Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.
Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?
If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.
The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.
... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...
You really want to update every single record in the table every day until the end of time?!
Selecting by user id is faster due to being a primary key
Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.
The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.
Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.
ETA: Regarding JSON stored in files
This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:
CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);
For example:
Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:
SELECT json_data FROM score_summaries
WHERE user_id=7
AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY);
//returns nothing so generate summary record
SELECT DATE(added_on), category, SUM(score)
FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY DATE(added_on), category; //never include today's data, encode as json with php
INSERT INTO score_summaries(user_id, json_data)
VALUES(7, '$json') //from PHP, in this case $json == NULL
ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)
//use $json for presentation too
Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.
method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating
CREATE TABLE `graphdata` (
`graphDataId` INT UNSIGNED NOT NULL,
`categoryId` INT NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB
than create 2 tables because you obviosuly need to have info connecting graphDataId with userId
create table 'graphDataUser'(
`graphDataId` INT UNSIGNED NOT NULL,
`userId` INT NOT NULL,
)ENGINE=InnoDB
and graphDataId date connection
create table 'graphDataDate'(
`graphDataId` INT UNSIGNED NOT NULL,
'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB
i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.

Optimizing an SQL query with generated GROUP BY statement

I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1
Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)
I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?
As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?
I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC
If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Categories