I have come up with a total of three different, equally viable methods for saving data for a graph.
The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.
Method 1:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`category` ENUM('buildings','items',...) NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`userid`, `date`, `category`),
INDEX `userid` (`userid`),
INDEX `date` (`date`)
) ENGINE=InnoDB
This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:
DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)
Method 2:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`buildings-1day` FLOAT UNSIGNED NOT NULL,
`buildings-2day` FLOAT UNSIGNED NOT NULL,
... (and so on for each category up to `-7day`
PRIMARY KEY (`userid`)
)
Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:
... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...
Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.
Method 3:
Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.
Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.
Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.
Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?
If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.
The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.
... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...
You really want to update every single record in the table every day until the end of time?!
Selecting by user id is faster due to being a primary key
Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.
The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.
Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.
ETA: Regarding JSON stored in files
This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:
CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);
For example:
Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:
SELECT json_data FROM score_summaries
WHERE user_id=7
AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY);
//returns nothing so generate summary record
SELECT DATE(added_on), category, SUM(score)
FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY DATE(added_on), category; //never include today's data, encode as json with php
INSERT INTO score_summaries(user_id, json_data)
VALUES(7, '$json') //from PHP, in this case $json == NULL
ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)
//use $json for presentation too
Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.
method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating
CREATE TABLE `graphdata` (
`graphDataId` INT UNSIGNED NOT NULL,
`categoryId` INT NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB
than create 2 tables because you obviosuly need to have info connecting graphDataId with userId
create table 'graphDataUser'(
`graphDataId` INT UNSIGNED NOT NULL,
`userId` INT NOT NULL,
)ENGINE=InnoDB
and graphDataId date connection
create table 'graphDataDate'(
`graphDataId` INT UNSIGNED NOT NULL,
'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB
i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.
Related
we have a MySQL (mariaDB/Galera) cluster containing several billion unique data points in one table.
We need to migrate that table to a new one sorting out doublicate entries which takes a very long time and we are constrained in that regard. The next step would be to genereate reports for a given time window and UUID of the correspoinding NAS (a router in the real world/a location) as well as unique IDs (MACs) of users that are recurrent or switch NASes
The MySQL (mariaDB/Galera) DB right now ist about 25GB in size which should not be an issue. But the queries for reports on UIDs/MACs of users in combination with UUIDs NASes/locations takes a very long time.
The table structure is layed out as depicted here. One is the actual table and two would be a possible optimization. But I really don't know if that would do anything.
Is our DB approach the right one or should we use a different one (DB, table structure, stack, whatever ..) (open for suggestions)
The query for the migration (which is very slow) is the following:
INSERT INTO `metric_macs` m
(`uuid`,`shortname`,`mac`,`start`,`stop`,`duration`)
VALUES
SELECT uuid, shortname, mac, a, b, duration
FROM import i
ON DUPLICATE KEY update m.id = m.id
Query for unique users:
SELECT DISTINCT mac FROM `metric_macs` WHERE uuid in ('xxxx','yyyyy') and ( start BETWEEN '2020-01-01' and '2020-02-01' or stop BETWEEN '2020-01-01' and '2020-02-01') ;
Count of all datasets
Query for recurrent users:
SELECT id FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
HAVING COUNT(*) > 1
Count of all datasets
Query for unique location switching users:
SELECT uuid,mac FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
After that php is used to count all users with more than two distinct UUIDS.
The list is updated every 15 minutes with a list of UIDs (MACs) that are connected to a NAS, that list is checked for activity of a given UID(MAC) in the last 20 minutes. If there was we update the stop count of the last entries an add 15 minutes and start the calculation gain.
Sorry for the mess. We are fairly new to this kind of report generation. What are the possible ways to optimize the database or queries for near instant reporting?
Thanks!
Edit:
CREATE TABLE `metric_macs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`shortname` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`mac` varchar(80) COLLATE utf8mb4_unicode_ci NOT NULL,
`start` datetime NOT NULL,
`stop` datetime NOT NULL,
`duration` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `metric_macs_uuid_index` (`uuid`),
KEY `metric_macs_mac_index` (`mac`),
KEY `metric_macs_start_stop_index` (`start`,`stop`),
KEY `metric_macs_uuid_start_stop_index` (`uuid`,`start`,`stop`),
KEY `metric_macs_uuid_stop_index` (`uuid`,`stop`)
) ENGINE=InnoDB AUTO_INCREMENT=357850432 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
It is good to move away from 36-byte ids. However, don't stop at 8 bytes; you probably don't need more than 4 bytes (INT UNSIGNED, max or 4 billion) or 3 (MEDIUMINT UNSIGNED, max of 16M).
While you are at it, remove the dashes and unhex the uuids so they can fit in BINARY(16) (16 bytes).
I think you have 3 problems to tackle in the conversion:
Efficiently changing the current schema to a better one. Assuming this has old, unchanging, rows, you can do this in the background.
Quickly finishing the final step. (We will actually do this last.)
Changing the ingestion to the new format.
Step 0: Grab the latest timestamp so you know where to do steps 2 and 3 after spending time doing step 1.
Step 1: To build users and stations, it might be simply
INSERT INTO users (user_id)
SELECT UUID2BIN(userID)
FROM ( SELECT DISTINCT userID ) FROM log;
(and similarly for stations)
See this for converting uuids: http://mysql.rjweb.org/doc.php/uuid
That may take some time, but it does the de-dup efficiently.
Let me discuss step 3 before filling in step 2.
Step 3: If the ingestion rate is "high", see this for details on a ping-ponging staging table and for bulk normalization, etc:
http://mysql.rjweb.org/doc.php/staging_table
However, your ingestion rate might be not that fast. Do not use IODKU with a trick of using LAST_INSERT_ID to get the id from users and stations. It will "burn" ids and threaten to overflow your INT/MEDIUMINT id. Instead, see the link above.
Inserting into time_table, if no more than 100 per second (HDD) or 1000 per second (SSD), can be a simple INSERT while you get the necessary ids
INSERT INTO time_table (user_id, station_id, start_time, stop_time)
VALUES (
( SELECT id FROM users WHERE userID = uuid2bin('...') ),
( SELECT id FROM stations WHERE userID = uuid2bin('...') ),
'...', '...'
);
Back to step 2. You have saved a bunch of rows in the old table. And you saved the starting date for those. So do the bulk normalization and mass insert from log as if it were a "staging table" as discussed in my link.
That should allow you to convert with zero downtime and only a small amount of time when the new table is "incomplete".
I have not covered why the "reports take a long time". I need to see the SELECTs. Meanwhile, here are two thoughts:
If you build the new INT-like ids, sort them by date so that they are at least somewhat ordered chronologically ordered, therefore better clustered for some types of queries.
In general, building and maintaining a "summary table" allows reports to be run much faster. See http://mysql.rjweb.org/doc.php/summarytables
"Query for recurrent users:" has multiple query performance problems. Unless my approach is not adequate, I don't want do get into the details.
I currently am working on some project that insert a lot of data into some tables. To ensure that my system is fast enough, I want to fragment my huge table into some smaller tables representing the months data. I have an idea of how it will work, but I still need some more informations.
The primary keys of my tables must be continuous so I thought of an architecture that would look like this:
CREATE TABLE `foo` (
`id` bigint(11) unsigned NOT NULL AUTO_INCREMENT,
}
CREATE TABLE `foo012014` (
`id` bigint(11),
`description` varchar(255),
}
CREATE TABLE `foo022014` (
`id` bigint(11),
`description` varchar(255),
}
On every insertion, the PHP page will look if a table already exists for the month and if not will create it.
The thing is, how do I get to bind the "foo" child table primary key to the "foo" mother table? Plus, is this design a bad practice or is it good?
It's not a good pratice, and difficult your queries.
With just the id you already have an index, which allows for better indexing of your data.
If your queries are also nicely written and organized, the time to execute a query in your database will be relatively small with 1 million rows or 20.
Solutions
First
For a better maintenance I recommend the following:
Add a new field in your table food: created datetime DEFAULT CURRENT_TIMESTAMP (works in MySQL 5.6+, for other versions, or set manually in every insert, or change to timestamp)
And, just use this field for group your data basead in datetime values, like that: 2014-01-24 13:18.
It's easy to select and manipulate.
Second
Create a external table with month and year, like that:
drop table if exists foo_periods;
create table foo_periods (
id int not null auto_increment primary key,
month smallint(4) not null,
year smallint(4) not null,
created datetime,
modified datetime,
active boolean not null default 1,
index foo_periods_month (month),
index foo_periods_year (year)
);
You can change smallint in month to varchar if you feels better.
Then, just create a FK, and done!
ALTER TABLE foo
ADD COLUMN foo_period_id int not null;
ALTER TABLE foo
ADD CONSTRAINT foo_foo_period_id
FOREIGN KEY (foo_period_id)
REFERENCES foo_periods (id);
References
If you want read more about fragmentation / optimization in MySQL, this is a great post.
MYSQL/PHP, I want to create a record of activities that people perform on the site.
Table ADDED -> EventID, UserID, Time, IP
Table DELETED -> EventID, UserID, Time, IP
Table SHARED -> EventID, UserID, Time, IP.
Is it more efficient to join these tables when querying to read for example the last 10 actions performed by a USERID, or would it be more efficient to structure like this.
Table EVERYTHING -> EventID, EventType(eg ADDED, DELETED, SHARED), UserID, Time, IP
Use one table which logs all events and differentiates the event type, as in your second suggestion.
You are storing only one type of data here, and it is therefore appropriate to store it in one table. In the early stages, you ought not worry too much about the size the table will grow to over time. Having only a few columns in a table like this, it can easily grow to many millions of rows before you would even need to consider partitioning it.
If you have a limited number of event types, you might consider using the ENUM() data type for the EventType column.
Using one table is the right thing to do because it is properly normalized. Adding a new event type should not require a new table. It's also much easier to maintain referential integrity and make use of indexes for retrieving and sorting all events for a user. (If you had them in separate tables, getting all events for a user and sorting them by time could be much, much slower than using one table!)
There are ways you can make these tables smaller, though, to save space and keep your indexes small:
Use an enum() to define your event types. If you have a small number of events, you use at most one byte per row.
Use an UNSIGNED integer type to get more EventID and UserIDs out of the same number of bytes.
If you don't need the full range of dates (likely), use a TIMESTAMP type to save 4 bytes per row vs a DATETIME type.
If you are only using ipv4 addresses, store the IP as an unsigned 4-byte integer and use INET_ATON() and INET_NTOA() to convert back and forth. This is the biggest winner here: a VARCHAR type would take at least 16 bytes, and you could potentially use a fixed row length format.
I recommend a table format like this:
CREATE TABLE Events (
`EventID` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`UserID` MEDIUMINT UNSIGNED NOT NULL COMMENT 'this allows a bit more than 16 million users, and your indexes will be smaller',
`EventType` ENUM('add','delete','share') NOT NULL,
`Time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`IP` INTEGER UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (`EventID`),
FOREIGN KEY (`UserID`) REFERENCES `Users` (`UserId`) ON UPDATE CASCADE ON DELETE CASCADE,
KEY (UserID)
);
If you store this using MyISAM, your row length will be 16 bytes, using a fixed format. This means every million rows requires 16MB of space for the data, and probably half that for indexes (depending on what indexes you use). This is so compact that mysql can probably keep the entire working portion of the table in memory most of the time.
Then it's an issue of creating the indexes you need for the operations that are most common. For example, if you always show all a user's events in a certain time range, replace KEY (UserID) with INDEX userbytime (UserID, Time). Then queries which are like SELECT * FROM Events WHERE UserID=? AND Time BETWEEN ? AND ? will be very fast.
I have the following schema with the following attributes:
USER(TABLE_NAME)
USER_ID|USERNAME|PASSWORD|TOPIC_NAME|FLAG1|FLAG2
I have 2 questions basically:
How can I make an attribute USER_ID as primary key and it should
automatically increment the value each time I insert the value into
the database.It shouldn't be under my control.
How can I retrieve a record from the database, based on the latest
time from which it was updated.( for example if I updated a record
at 2pm and same record at 3pm, if I retrieve now at 4pm I should get
the record that was updated at 3pm i.e. the latest updated one.)
Please help.
I'm assuming that question one is in the context of MYSQL. So, you can use the ALTER TABLE statement to mark a field as PRIMARY KEY, and to mark it AUTOINCREMENT
ALTER TABLE User
ADD PRIMARY KEY (USER_ID);
ALTER TABLE User
MODIFY COLUMN USER_ID INT(4) AUTO_INCREMENT; -- of course, set the type appropriately
For the second question I'm not sure I understand correctly so I'm just going to go ahead and give you some basic information before giving an answer that may confuse you.
When you update the same record multiple times, only the most recent update is persisted. Basically, once you update a record, it's previous values are not kept. So, if you update a record at 2pm, and then update the same record at 3pm - when you query for the record you will automatically receive the most recent values.
Now, if by updating you mean you would insert new values for the same USER_ID multiple times and want to retrieve the most recent, then you would need to use a field in the table to store a timestamp of when each record is created/updated. Then you can query for the most recent value based on the timestamp.
I assume you're talking about Oracle since you tagged it as Oracle. You also tagged the question as MySQL where the approach will be different.
You can make the USER_ID column a primary key
ALTER TABLE <<table_name>>
ADD CONSTRAINT pk_user_id PRIMARY KEY( user_id );
If you want the value to increment automatically, you'd need to create a sequence
CREATE SEQUENCE user_id_seq
START WITH 1
INCREMENT BY 1
CACHE 20;
and then create a trigger on the table that uses the sequence
CREATE OR REPLACE TRIGGER trg_assign_user_id
BEFORE INSERT ON <<table name>>
FOR EACH ROW
BEGIN
:new.user_id := user_id_seq.nextval;
END;
As for your second question, I'm not sure that I understand. If you update a row and then commit that change, all subsequent queries are going to read the updated data (barring exceptionally unlikely cases where you've set a serializable transaction isolation level and you've got transactions that run for multiple hours and you're running the query in that transaction). You don't need to do anything to see the current data.
(Answer based on MySQL; conceptually similar answer if using Oracle, but the SQL will probably be different.)
If USER_ID was not defined as a primary key or automatically incrementing at the time of table creation, then you can use:
ALTER TABLE tablename MODIFY USER_ID INT NOT NULL PRIMARY KEY AUTO_INCREMENT;
To issue queries based on record dates, you have to have a field defined to hold date-related datetypes. The date and time of record modifications would be something you would manage (e.g. add/change) based on the way in which you are accessing the records (some PHP-related way? it's unclear what scripts you have in play, based on your question.) Once you have dates in your records you can ORDER BY the date field in your SELECT query.
Check this out
For your AUTOINCREMENT, Its a question already asked here
For your PRIMARY KEY use this
ALTER TABLE USER ADD PRIMARY KEY (USER_ID)
Can you provide more information. If the value gets updated you definitely do NOT have your old value that you entered at 2pm present in the dB. So querying for it will be fine
You can use something like this:
CREATE TABLE IF NOT EXISTS user (
USER_ID unsigned int(8) NOT NULL AUTO_INCREMENT,
username varchar(25) NOT NULL,
password varchar(25) NOT NULL,
topic_name varchar(100) NOT NULL,
flag1 smallint(1) NOT NULL DEFAULT 0,
flag2 smallint(1) NOT NULL DEFAULT 0,
update_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (uid)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
For selection use query:
SELECT * from user ORDER BY update_time DESC
I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1
Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)
I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?
As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?
I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC
If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html