MYSQL/PHP, I want to create a record of activities that people perform on the site.
Table ADDED -> EventID, UserID, Time, IP
Table DELETED -> EventID, UserID, Time, IP
Table SHARED -> EventID, UserID, Time, IP.
Is it more efficient to join these tables when querying to read for example the last 10 actions performed by a USERID, or would it be more efficient to structure like this.
Table EVERYTHING -> EventID, EventType(eg ADDED, DELETED, SHARED), UserID, Time, IP
Use one table which logs all events and differentiates the event type, as in your second suggestion.
You are storing only one type of data here, and it is therefore appropriate to store it in one table. In the early stages, you ought not worry too much about the size the table will grow to over time. Having only a few columns in a table like this, it can easily grow to many millions of rows before you would even need to consider partitioning it.
If you have a limited number of event types, you might consider using the ENUM() data type for the EventType column.
Using one table is the right thing to do because it is properly normalized. Adding a new event type should not require a new table. It's also much easier to maintain referential integrity and make use of indexes for retrieving and sorting all events for a user. (If you had them in separate tables, getting all events for a user and sorting them by time could be much, much slower than using one table!)
There are ways you can make these tables smaller, though, to save space and keep your indexes small:
Use an enum() to define your event types. If you have a small number of events, you use at most one byte per row.
Use an UNSIGNED integer type to get more EventID and UserIDs out of the same number of bytes.
If you don't need the full range of dates (likely), use a TIMESTAMP type to save 4 bytes per row vs a DATETIME type.
If you are only using ipv4 addresses, store the IP as an unsigned 4-byte integer and use INET_ATON() and INET_NTOA() to convert back and forth. This is the biggest winner here: a VARCHAR type would take at least 16 bytes, and you could potentially use a fixed row length format.
I recommend a table format like this:
CREATE TABLE Events (
`EventID` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`UserID` MEDIUMINT UNSIGNED NOT NULL COMMENT 'this allows a bit more than 16 million users, and your indexes will be smaller',
`EventType` ENUM('add','delete','share') NOT NULL,
`Time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`IP` INTEGER UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (`EventID`),
FOREIGN KEY (`UserID`) REFERENCES `Users` (`UserId`) ON UPDATE CASCADE ON DELETE CASCADE,
KEY (UserID)
);
If you store this using MyISAM, your row length will be 16 bytes, using a fixed format. This means every million rows requires 16MB of space for the data, and probably half that for indexes (depending on what indexes you use). This is so compact that mysql can probably keep the entire working portion of the table in memory most of the time.
Then it's an issue of creating the indexes you need for the operations that are most common. For example, if you always show all a user's events in a certain time range, replace KEY (UserID) with INDEX userbytime (UserID, Time). Then queries which are like SELECT * FROM Events WHERE UserID=? AND Time BETWEEN ? AND ? will be very fast.
Related
we have a MySQL (mariaDB/Galera) cluster containing several billion unique data points in one table.
We need to migrate that table to a new one sorting out doublicate entries which takes a very long time and we are constrained in that regard. The next step would be to genereate reports for a given time window and UUID of the correspoinding NAS (a router in the real world/a location) as well as unique IDs (MACs) of users that are recurrent or switch NASes
The MySQL (mariaDB/Galera) DB right now ist about 25GB in size which should not be an issue. But the queries for reports on UIDs/MACs of users in combination with UUIDs NASes/locations takes a very long time.
The table structure is layed out as depicted here. One is the actual table and two would be a possible optimization. But I really don't know if that would do anything.
Is our DB approach the right one or should we use a different one (DB, table structure, stack, whatever ..) (open for suggestions)
The query for the migration (which is very slow) is the following:
INSERT INTO `metric_macs` m
(`uuid`,`shortname`,`mac`,`start`,`stop`,`duration`)
VALUES
SELECT uuid, shortname, mac, a, b, duration
FROM import i
ON DUPLICATE KEY update m.id = m.id
Query for unique users:
SELECT DISTINCT mac FROM `metric_macs` WHERE uuid in ('xxxx','yyyyy') and ( start BETWEEN '2020-01-01' and '2020-02-01' or stop BETWEEN '2020-01-01' and '2020-02-01') ;
Count of all datasets
Query for recurrent users:
SELECT id FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
HAVING COUNT(*) > 1
Count of all datasets
Query for unique location switching users:
SELECT uuid,mac FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
After that php is used to count all users with more than two distinct UUIDS.
The list is updated every 15 minutes with a list of UIDs (MACs) that are connected to a NAS, that list is checked for activity of a given UID(MAC) in the last 20 minutes. If there was we update the stop count of the last entries an add 15 minutes and start the calculation gain.
Sorry for the mess. We are fairly new to this kind of report generation. What are the possible ways to optimize the database or queries for near instant reporting?
Thanks!
Edit:
CREATE TABLE `metric_macs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`shortname` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`mac` varchar(80) COLLATE utf8mb4_unicode_ci NOT NULL,
`start` datetime NOT NULL,
`stop` datetime NOT NULL,
`duration` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `metric_macs_uuid_index` (`uuid`),
KEY `metric_macs_mac_index` (`mac`),
KEY `metric_macs_start_stop_index` (`start`,`stop`),
KEY `metric_macs_uuid_start_stop_index` (`uuid`,`start`,`stop`),
KEY `metric_macs_uuid_stop_index` (`uuid`,`stop`)
) ENGINE=InnoDB AUTO_INCREMENT=357850432 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
It is good to move away from 36-byte ids. However, don't stop at 8 bytes; you probably don't need more than 4 bytes (INT UNSIGNED, max or 4 billion) or 3 (MEDIUMINT UNSIGNED, max of 16M).
While you are at it, remove the dashes and unhex the uuids so they can fit in BINARY(16) (16 bytes).
I think you have 3 problems to tackle in the conversion:
Efficiently changing the current schema to a better one. Assuming this has old, unchanging, rows, you can do this in the background.
Quickly finishing the final step. (We will actually do this last.)
Changing the ingestion to the new format.
Step 0: Grab the latest timestamp so you know where to do steps 2 and 3 after spending time doing step 1.
Step 1: To build users and stations, it might be simply
INSERT INTO users (user_id)
SELECT UUID2BIN(userID)
FROM ( SELECT DISTINCT userID ) FROM log;
(and similarly for stations)
See this for converting uuids: http://mysql.rjweb.org/doc.php/uuid
That may take some time, but it does the de-dup efficiently.
Let me discuss step 3 before filling in step 2.
Step 3: If the ingestion rate is "high", see this for details on a ping-ponging staging table and for bulk normalization, etc:
http://mysql.rjweb.org/doc.php/staging_table
However, your ingestion rate might be not that fast. Do not use IODKU with a trick of using LAST_INSERT_ID to get the id from users and stations. It will "burn" ids and threaten to overflow your INT/MEDIUMINT id. Instead, see the link above.
Inserting into time_table, if no more than 100 per second (HDD) or 1000 per second (SSD), can be a simple INSERT while you get the necessary ids
INSERT INTO time_table (user_id, station_id, start_time, stop_time)
VALUES (
( SELECT id FROM users WHERE userID = uuid2bin('...') ),
( SELECT id FROM stations WHERE userID = uuid2bin('...') ),
'...', '...'
);
Back to step 2. You have saved a bunch of rows in the old table. And you saved the starting date for those. So do the bulk normalization and mass insert from log as if it were a "staging table" as discussed in my link.
That should allow you to convert with zero downtime and only a small amount of time when the new table is "incomplete".
I have not covered why the "reports take a long time". I need to see the SELECTs. Meanwhile, here are two thoughts:
If you build the new INT-like ids, sort them by date so that they are at least somewhat ordered chronologically ordered, therefore better clustered for some types of queries.
In general, building and maintaining a "summary table" allows reports to be run much faster. See http://mysql.rjweb.org/doc.php/summarytables
"Query for recurrent users:" has multiple query performance problems. Unless my approach is not adequate, I don't want do get into the details.
Let's say we have a web forum application with a MySQL 5.6 database that are accessed 24/7 by many many users. Now there is a table like this for metadata of notifications sent to users.
| notifications | CREATE TABLE `notifications` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned NOT NULL,
`message_store_id` bigint(20) unsigned NOT NULL,
`status` varchar(10) COLLATE ascii_bin NOT NULL,
`sent_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`,`sent_date`)
) ENGINE=InnoDB AUTO_INCREMENT=736601 DEFAULT CHARSET=ascii COLLATE=ascii_bin |
This table has 1 million rows. With this table, a certain message_store_id becomes suddenly ineffective for some reason and I'm planning to remove all of records with that message_store_id with a single delete statement like
DELETE FROM notifications WHERE message_store_id = 12345;
This single statement affects 10% of the table since this message was sent to so many users. Meanwhile this notifications tables are accessed all the time by thousands of users, so the index must be present. Apparently index recreation is very costly when deleting records, so I'm afraid to do that and cause down time by maxing out the server resources. However, if I drop the index, delete the records then add an index again, I have to shut down the database for some time, unfortunately it is not possible for our service.
I wish MySQL 5.6 is not so stupid that this single statement can kill the database, but I guess it's very likely. My question is, is the index recreation really fatal for a case like this? If so, is there any good strategy for this operation that doesn't require me to halt the database for the maintenance?
There can be a lot of tricks/strategies you could employ depending on details of your application.
If you plan to do these operations on a regular basis (e.g. it's not a one-time thing), AND you have few distinct values in message_store_id, you can use partitions. Partition by value of message_store_id, create X partitions beforehand (where X is some reasonable cap on the amount of values for the id), and then you can delete all the records in that partition in an instant by truncating that partition. A matter of milliseconds. Downside: message_store_id will have to be a part of primary key. Note: you'll have to create partitions beforehand, because the last time I worked with them, alter table add partition re-created the entire table, which is a disaster on large tables.
Even if the alter table truncate partition does not work for you, you can still benefit from partitioning. If you issue a DELETE on the partition, by supplying corresponding where condition, the rest of the table will not be affected/locked by this DELETE op.
Alternative way of deleting records without locking the DB for too long:
while (true) {
// assuming autocommit mode
delete from table where {your condition} limit 10000;
// at this moment locks are released and other transactions have a chance
// to do some stuff.
if (affected rows == 0) {
break;
}
// This is a good place to insert sleep(5) to give other transactions
// more time to do their stuff before the next chunk gets deleted.
}
One option is to perform the delete as several smaller operations, rather than one huge operation.
MySQL provides a LIMIT clause, which will limit the number of rows matched by the query.
For example, you could delete just 1000 rows:
DELETE FROM notifications WHERE message_store_id = 12345 LIMIT 1000;
You could repeat that, leaving a suitable window of time for other operations (competing for
locks on the same table) to complete. To handle this in pure SQL, we can use the MySQL SLEEP() function, to pause for 2 seconds, for example:
SELECT SLEEP(2);
And obviously, this can be incorporated into a loop, in a MySQL procedure, for example, continuing to loop until the DELETE statement affects zero rows.
I have come up with a total of three different, equally viable methods for saving data for a graph.
The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.
Method 1:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`category` ENUM('buildings','items',...) NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`userid`, `date`, `category`),
INDEX `userid` (`userid`),
INDEX `date` (`date`)
) ENGINE=InnoDB
This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:
DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)
Method 2:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`buildings-1day` FLOAT UNSIGNED NOT NULL,
`buildings-2day` FLOAT UNSIGNED NOT NULL,
... (and so on for each category up to `-7day`
PRIMARY KEY (`userid`)
)
Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:
... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...
Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.
Method 3:
Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.
Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.
Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.
Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?
If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.
The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.
... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...
You really want to update every single record in the table every day until the end of time?!
Selecting by user id is faster due to being a primary key
Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.
The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.
Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.
ETA: Regarding JSON stored in files
This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:
CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);
For example:
Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:
SELECT json_data FROM score_summaries
WHERE user_id=7
AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY);
//returns nothing so generate summary record
SELECT DATE(added_on), category, SUM(score)
FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY DATE(added_on), category; //never include today's data, encode as json with php
INSERT INTO score_summaries(user_id, json_data)
VALUES(7, '$json') //from PHP, in this case $json == NULL
ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)
//use $json for presentation too
Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.
method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating
CREATE TABLE `graphdata` (
`graphDataId` INT UNSIGNED NOT NULL,
`categoryId` INT NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB
than create 2 tables because you obviosuly need to have info connecting graphDataId with userId
create table 'graphDataUser'(
`graphDataId` INT UNSIGNED NOT NULL,
`userId` INT NOT NULL,
)ENGINE=InnoDB
and graphDataId date connection
create table 'graphDataDate'(
`graphDataId` INT UNSIGNED NOT NULL,
'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB
i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.
I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1
Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)
I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?
As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?
I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC
If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
I have the talbe like that:
CREATE TABLE UserTrans (
`id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`transaction_id` varchar(255) NOT NULL default '0',
`source` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
)
with innodb engine.
The transaction_id is var because sometimes it can be aphanumeric.
the id is the primary key.
so.. here is the thing, I have over 1M records. However, there is a query to check for duplicate transaciton_id on the specified source. So, here is my query:
SELECT *
FROM UserTrans
WHERE transaction_id = '212398043'
AND source = 'COMPANY_A';
this query getting very slow, like 2 seconds to run now. Should I index the transaction_id and the source?
e.g. KEY join_id (transaction_id, source)
What is the drawback if i do that?
Obviously the benefit is that it will improve the performance of certain queries.
The drawback is that it will take a bit of space to store the index and a bit of work for the RDBMS to maintain the index. The index is especially prone to consume space because your transaction_id is such a wide string.
You might consider whether transaction_id really needs to be up to 255 characters long, or if you could declare its max length to be something shorter.
Or you could use a prefix index to index only the first n characters:
CREATE INDEX join_id ON UserTrans (transaction_id(16), source(16));
#Daniel has a good point that you might get the same benefit and save even more space by indexing only one column. Since you're doing SELECT * you've ruled out the benefit of a covering index.
Also if you intend transaction_id to be unique, why not constrain it to be unique?
CREATE UNIQE INDEX uq_transaction_id ON UserTrans (transaction_id(16));
The main drawback is that the new index will take up space on your disks. It will also make inserts and updates a little bit slower (but this is often negligible in most situations).
On the other hand, your query will probably run in just a few milliseconds instead of 2 seconds.
The drawbacks to adding indices are space (since storing indexes does take up space) and insert time (since when you insert new records, they have to be added to the indices).
That said, you may not need to index both fields - just indexing one of them may be enough.
I would think about diching your id column and use transaction_id as your primary key
I am assuming that transaction_id is unique.
this will mean that your schema prevents you from inserting a transaction id that is already there.
this reduces the the amount of data being stored, and also reduces the number of columns needing to be indexed.
if source company and transaction_id are infact a composite key.. i would make the two columns the primary key.
your current schema allows you to put in duplicates, which is an unnecessary evil.