I have a MySQL table with the following configuration:
CREATE TABLE `MONITORING` (
`REC_ID` int(20) NOT NULL AUTO_INCREMENT,
`TIME` int(11) NOT NULL,
`DEVICE_ID` varchar(30) COLLATE utf8_unicode_ci NOT NULL,
`MON_ID` varchar(10) COLLATE utf8_unicode_ci NOT NULL,
`TEMPERATURE` float NOT NULL,
`HUMIDITY` float NOT NULL,
PRIMARY KEY (`REC_ID`),
KEY `SelectQueryIndex` (`TIME`,`MON_ID`))
ENGINE=MyISAM AUTO_INCREMENT=102069 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci
Multiple Monitoring Devices send data, always exactly on the minute, but not all monitors are always online. I am using PHP to query the database and format the data to put into a Google Line Chart.
To get the data into the Google Chart I am running a SELECT Query which is giving the results with all of the MON_ID's on a single line.
The Query I am currently using is:
SELECT `TIME`, `H5-C-T`, `P-C-T`, `H5-C-H`, `P-C-H`, `A-T`, `A-H` FROM
(SELECT `TIME`, `TEMPERATURE` as 'H5-C-T', `HUMIDITY` as 'H5-C-H' FROM `MONITORING` where `MON_ID` = 'H5-C') AS TAB_1,
(SELECT `TIME` as `TIME2`, `TEMPERATURE` as 'P-C-T', `HUMIDITY` as 'P-C-H' FROM `MONITORING` where `MON_ID` = 'P-C') AS TAB_2,
(SELECT `TIME` as `TIME3`, `TEMPERATURE` as 'A-T', `HUMIDITY` as 'A-H' FROM `MONITORING` where `MON_ID` = 'Ambient') AS TAB_3
WHERE TAB_1.TIME = TAB_2.TIME2 AND TAB_1.TIME = TAB_3.TIME3
The results are exactly what I want (Table with TIME and then a Temp and RH column for each of the three monitors), but seems like the query is taking a lot longer than it should to give the results.
Opening the full table, or selecting all rows of just one monitoring device takes about 0.0006 seconds (can't ask for much better than that).
If I do the query with 2 of the monitoring devices it takes about 0.09 seconds (still not bad, but a pretty big percentage increase).
When I put in the third monitoring device the query goes up to about 2.5 seconds (this is okay now, but as more data is collected and more of the devices end up needing to be in charts at one time, it is going to get excessive pretty quick).
I have looked at a lot of posts where people were trying to optimize their queries, but could not find any which were doing the query the same way as me (maybe I am doing it a bad way...). From the other things people have done to improve performance I have tried multiple indexing methods, made sure to check, analyze, and optimize the table in PHP MyAdmin, tried several other querying methods, changed sort field / order of the table, etc. but have not been able to find another way to get the results I need which was any faster.
My table has a total of a little under 100,000 total rows, and it seems like my query speeds are WAY longer than should be expected based off of the many people I saw doing queries on tables with tens of millions of records.
Any recommendations on a way to optimize my query?
Maybe the answer is something like multiple MySQL queries and then somehow merge them together in PHP (I tried to figure out a way to do this, but could not get it to work)?
Flip things inside out; performance will be a lot better:
SELECT h.`TIME`,
h.TEMPERATURE AS 'H5-C-T',
p.TEMPERATURE AS 'P-C-T',
h.HUMIDITY AS 'H5-C-H',
p.HUMIDITY AS 'P-C-H',
a.TEMPERATURE AS 'A-T',
a.HUMIDITY AS 'A-H'
FROM MONITORING AS h
JOIN MONITORING AS p ON h.TIME = p.TIME
JOIN MONITORING AS a ON a.TIME = h.TIME
WHERE h.`MON_ID` = 'H5-C'
AND p.`MON_ID` = 'P-C'
AND a.`MON_ID` = 'Ambient'
And use JOIN...ON syntax.
Is the combination of TIME and REC_ID unique? If so, performance would be even better if you switched to InnoDB, got rid of REC_ID and changed KEY SelectQueryIndex (TIME,MON_ID) into PRIMARY KEY(TIME, MON_ID).
You should also consider switching to InnoDB.
Related
we have a MySQL (mariaDB/Galera) cluster containing several billion unique data points in one table.
We need to migrate that table to a new one sorting out doublicate entries which takes a very long time and we are constrained in that regard. The next step would be to genereate reports for a given time window and UUID of the correspoinding NAS (a router in the real world/a location) as well as unique IDs (MACs) of users that are recurrent or switch NASes
The MySQL (mariaDB/Galera) DB right now ist about 25GB in size which should not be an issue. But the queries for reports on UIDs/MACs of users in combination with UUIDs NASes/locations takes a very long time.
The table structure is layed out as depicted here. One is the actual table and two would be a possible optimization. But I really don't know if that would do anything.
Is our DB approach the right one or should we use a different one (DB, table structure, stack, whatever ..) (open for suggestions)
The query for the migration (which is very slow) is the following:
INSERT INTO `metric_macs` m
(`uuid`,`shortname`,`mac`,`start`,`stop`,`duration`)
VALUES
SELECT uuid, shortname, mac, a, b, duration
FROM import i
ON DUPLICATE KEY update m.id = m.id
Query for unique users:
SELECT DISTINCT mac FROM `metric_macs` WHERE uuid in ('xxxx','yyyyy') and ( start BETWEEN '2020-01-01' and '2020-02-01' or stop BETWEEN '2020-01-01' and '2020-02-01') ;
Count of all datasets
Query for recurrent users:
SELECT id FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
HAVING COUNT(*) > 1
Count of all datasets
Query for unique location switching users:
SELECT uuid,mac FROM `metric_macs`
WHERE uuid in ('xxxx','yyyyy')
and ( start BETWEEN '2020-01-01' and '2020-02-01'
or stop BETWEEN '2020-01-01' and '2020-02-01')
GROUP BY `mac`, `uuid`
After that php is used to count all users with more than two distinct UUIDS.
The list is updated every 15 minutes with a list of UIDs (MACs) that are connected to a NAS, that list is checked for activity of a given UID(MAC) in the last 20 minutes. If there was we update the stop count of the last entries an add 15 minutes and start the calculation gain.
Sorry for the mess. We are fairly new to this kind of report generation. What are the possible ways to optimize the database or queries for near instant reporting?
Thanks!
Edit:
CREATE TABLE `metric_macs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`shortname` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`mac` varchar(80) COLLATE utf8mb4_unicode_ci NOT NULL,
`start` datetime NOT NULL,
`stop` datetime NOT NULL,
`duration` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `metric_macs_uuid_index` (`uuid`),
KEY `metric_macs_mac_index` (`mac`),
KEY `metric_macs_start_stop_index` (`start`,`stop`),
KEY `metric_macs_uuid_start_stop_index` (`uuid`,`start`,`stop`),
KEY `metric_macs_uuid_stop_index` (`uuid`,`stop`)
) ENGINE=InnoDB AUTO_INCREMENT=357850432 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
It is good to move away from 36-byte ids. However, don't stop at 8 bytes; you probably don't need more than 4 bytes (INT UNSIGNED, max or 4 billion) or 3 (MEDIUMINT UNSIGNED, max of 16M).
While you are at it, remove the dashes and unhex the uuids so they can fit in BINARY(16) (16 bytes).
I think you have 3 problems to tackle in the conversion:
Efficiently changing the current schema to a better one. Assuming this has old, unchanging, rows, you can do this in the background.
Quickly finishing the final step. (We will actually do this last.)
Changing the ingestion to the new format.
Step 0: Grab the latest timestamp so you know where to do steps 2 and 3 after spending time doing step 1.
Step 1: To build users and stations, it might be simply
INSERT INTO users (user_id)
SELECT UUID2BIN(userID)
FROM ( SELECT DISTINCT userID ) FROM log;
(and similarly for stations)
See this for converting uuids: http://mysql.rjweb.org/doc.php/uuid
That may take some time, but it does the de-dup efficiently.
Let me discuss step 3 before filling in step 2.
Step 3: If the ingestion rate is "high", see this for details on a ping-ponging staging table and for bulk normalization, etc:
http://mysql.rjweb.org/doc.php/staging_table
However, your ingestion rate might be not that fast. Do not use IODKU with a trick of using LAST_INSERT_ID to get the id from users and stations. It will "burn" ids and threaten to overflow your INT/MEDIUMINT id. Instead, see the link above.
Inserting into time_table, if no more than 100 per second (HDD) or 1000 per second (SSD), can be a simple INSERT while you get the necessary ids
INSERT INTO time_table (user_id, station_id, start_time, stop_time)
VALUES (
( SELECT id FROM users WHERE userID = uuid2bin('...') ),
( SELECT id FROM stations WHERE userID = uuid2bin('...') ),
'...', '...'
);
Back to step 2. You have saved a bunch of rows in the old table. And you saved the starting date for those. So do the bulk normalization and mass insert from log as if it were a "staging table" as discussed in my link.
That should allow you to convert with zero downtime and only a small amount of time when the new table is "incomplete".
I have not covered why the "reports take a long time". I need to see the SELECTs. Meanwhile, here are two thoughts:
If you build the new INT-like ids, sort them by date so that they are at least somewhat ordered chronologically ordered, therefore better clustered for some types of queries.
In general, building and maintaining a "summary table" allows reports to be run much faster. See http://mysql.rjweb.org/doc.php/summarytables
"Query for recurrent users:" has multiple query performance problems. Unless my approach is not adequate, I don't want do get into the details.
I have 2 tables. 1 is music and 2 is listenTrack. listenTrack tracks the unique plays of each song. I am trying to get results for popular songs of the month. I'm getting my results but they are just taking too long. Below is my tables and query
430,000 rows
CREATE TABLE `listentrack` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sessionId` varchar(50) NOT NULL,
`url` varchar(50) NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ip` varchar(150) NOT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=731306 DEFAULT CHARSET=utf8
12500 rows
CREATE TABLE `music` (
`music_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`title` varchar(50) DEFAULT NULL,
`artist` varchar(50) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`genre` int(4) DEFAULT NULL,
`file` varchar(255) NOT NULL,
`url` varchar(50) NOT NULL,
`allow_download` int(2) NOT NULL DEFAULT '1',
`plays` bigint(20) NOT NULL,
`downloads` bigint(20) NOT NULL,
`faved` bigint(20) NOT NULL,
`dateadded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`music_id`)
) ENGINE=MyISAM AUTO_INCREMENT=15146 DEFAULT CHARSET=utf8
SELECT COUNT(listenTrack.url) AS total, listenTrack.url
FROM listenTrack
LEFT JOIN music ON music.url = listenTrack.url
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY listenTrack.url
ORDER BY total DESC
LIMIT 0,10
this query isn't very complex and the rows aren't too large, i don't think.
Is there any way to speed this up? Or can you suggest a better solution? This is going to be a cron job at the beggining of every month but I would also like to do by the day results as well.
Oh btw i am running this locally, over 4 min to run, but on prod it takes about 45 secs
I'm more of a SQL Server guy but these concepts should apply.
I'd add indexes:
On ListenTrack, add an index with url, and date_created
On Music, add an index with url
These indexes should speed the query up tremendously (I originally had the table names mixed up - fixed in the latest edit).
For the most part you should also index any column that is used in a JOIN. In your case, you should index both listentrack.url and music.url
#jeff s - An index music.date_created wouldnt help because you are running that through a function first so MySQL cannot use an index on that column. Often, you can rewrite a query so that the indexed referenced column is used statically like:
DATEDIFF(DATE(date_created),'2009-08-15') = 0
becomes
date_created >= '2009-08-15' and date_created < '2009-08-15'
This will filter down records that are from 2009-08-15 and allow any indexes on that column to be candidates. Note that MySQL might NOT use that index, it depends on other factors.
Your best bet is to make a dual index on listentrack(url, date_created)
and then another index on music.url
These 2 indexes will cover this particular query.
Note that if you run EXPLAIN on this query you are still going to get a using filesort because it has to write the records to a temporary table on disk to do the ORDER BY.
In general you should always run your query under EXPLAIN to get an idea on how MySQL will execute the query and then go from there. See the EXPLAIN documentation:
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
Try creating an index that will help with the join:
CREATE INDEX idx_url ON music (url);
I think I might have missed the obvious before. Why are you joining the music table at all? You do not appear to be using the data in that table at all and you are performing a left join which is not required, right? I think this table being in the query will make it much slower and will not add any value. Take all references to music out, unless the url inclusion is required, in which case you need a right join to force it to not include a row without a matching value.
I would add new indexes, as the others mention. Specifically I would add:
music url
listentrack date_created,url
This will improve your join a ton.
Then I would look at the query, you are forcing the system to perform work on each row of the table. It would be better to rephrase the date restriction as a range.
Not sure of the syntax off the top of my head:
where '2009-08-15 00:00:00' <= date_created < 2009-08-16 00:00:00
That should allow it to rapidly use the index to locate the appropriate records. The combined two key index on music should allow it to find the records based on the date and URL. You should experiment, they might be better off going in the other direction url,date_created on the index.
The explain plan for this query should say "using index" on the right hand column for both. That means that it will not have to hit the data in the table to calculate your sums.
I would also check the memory settings that you have configured for MySQL. It sounds like you do not have enough memory allocated. Be very careful on the differences between server based settings and thread based settings. The server with a 10MB cache is pretty small, a thread with a 10MB cache can use a lot of memory quickly.
Jacob
Pre-grouping and then joining makes things a lot faster with MySQL/MyISAM. (I'm suspicious less of this is needed with other DB's)
This should perform about as fast as the non-joined version:
SELECT
total, a.url, title
FROM
(
SELECT COUNT(*) as total, url
from listenTrack
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY url
ORDER BY total DESC
LIMIT 0,10
) as a
LEFT JOIN music ON music.url = a.url
;
P.S. - Mapping between the two tables with an id instead of a url is sound advice.
Why are you repeating the url in both tables?
Have listentrack hold a music_id instead, and join on that. Gets rid of the text search as well as the extra index.
Besides, it's arguably more correct. You're tracking the times that a particular track was listened to, not the url. What if the url changes?
After you add indexes then you may want to explore adding a new column for the date_created to be a unix_timestamp, which will make math operations quicker.
I am not certain why you have the diff function though, as it appears you are looking for all rows that were updated on a particular date.
You may want to look at your query as it seems to have an error.
If you use unit tests then you can compare the results of your query and a query using a unix timestamp instead.
you might want to add an index to the url field of both tables.
having said that, when i converted from mysql to sql server 2008, with the same queries and same database structures, the queries ran 1-3 orders of magnitude faster.
i think some of it had to do with the rdbms (mysql optimizers are not so good...) and some of it might have had to do with how the rdbms reserve system resources. although, the comparisons were made on production systems where only the db would run.
This below would probably work to speed up the query.
CREATE INDEX music_url_index ON music (url) USING BTREE;
CREATE INDEX listenTrack_url_index ON listenTrack (url) USING BTREE;
You really need to know the total number of comparisons and row scans that are happening. To get that answer look at the code here of how to do that using explain http://www.siteconsortium.com/h/p1.php?id=mysql002.
I have a simple database to store some emails, Is there any way to get the emails from the database in groups of a 100 emails ???
Here is the database structure
Table structure for table `emails`
--
CREATE TABLE IF NOT EXISTS `emails` (
`email` varchar(100) NOT NULL,
`ip` varchar(15) NOT NULL,
`timestamp` datetime NOT NULL,
PRIMARY KEY (`email`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
The ideal outcome would be to get the first 100 emails, do something and later get the next 100 and do something else.
Thanks and sorry for my english
LIMIT 100
http://dev.mysql.com/doc/refman/5.1/en/select.html#id827984
Then for the next one, LIMIT 100, 100, and then LIMIT 200, 100, and so on.
The first number (when there are two) is the offset.
Use the SQL LIMIT <offset>, <row_count> syntax to paginate via SQL.
Assuming that you are going to be displaying these results on a webpage, I'd also recommend that you use a PHP library like PEAR::Pager to help you out here as well. There is a good tutorials here.
One thing to note, for performance reasons, you should ensure that you are using ORDER BY in your paginated query and that you are using an index. The MySQL Performance blog has a good explanation.
I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1
Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)
I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?
As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?
I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC
If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
I have 2 tables. 1 is music and 2 is listenTrack. listenTrack tracks the unique plays of each song. I am trying to get results for popular songs of the month. I'm getting my results but they are just taking too long. Below is my tables and query
430,000 rows
CREATE TABLE `listentrack` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sessionId` varchar(50) NOT NULL,
`url` varchar(50) NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ip` varchar(150) NOT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=731306 DEFAULT CHARSET=utf8
12500 rows
CREATE TABLE `music` (
`music_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`title` varchar(50) DEFAULT NULL,
`artist` varchar(50) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`genre` int(4) DEFAULT NULL,
`file` varchar(255) NOT NULL,
`url` varchar(50) NOT NULL,
`allow_download` int(2) NOT NULL DEFAULT '1',
`plays` bigint(20) NOT NULL,
`downloads` bigint(20) NOT NULL,
`faved` bigint(20) NOT NULL,
`dateadded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`music_id`)
) ENGINE=MyISAM AUTO_INCREMENT=15146 DEFAULT CHARSET=utf8
SELECT COUNT(listenTrack.url) AS total, listenTrack.url
FROM listenTrack
LEFT JOIN music ON music.url = listenTrack.url
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY listenTrack.url
ORDER BY total DESC
LIMIT 0,10
this query isn't very complex and the rows aren't too large, i don't think.
Is there any way to speed this up? Or can you suggest a better solution? This is going to be a cron job at the beggining of every month but I would also like to do by the day results as well.
Oh btw i am running this locally, over 4 min to run, but on prod it takes about 45 secs
I'm more of a SQL Server guy but these concepts should apply.
I'd add indexes:
On ListenTrack, add an index with url, and date_created
On Music, add an index with url
These indexes should speed the query up tremendously (I originally had the table names mixed up - fixed in the latest edit).
For the most part you should also index any column that is used in a JOIN. In your case, you should index both listentrack.url and music.url
#jeff s - An index music.date_created wouldnt help because you are running that through a function first so MySQL cannot use an index on that column. Often, you can rewrite a query so that the indexed referenced column is used statically like:
DATEDIFF(DATE(date_created),'2009-08-15') = 0
becomes
date_created >= '2009-08-15' and date_created < '2009-08-15'
This will filter down records that are from 2009-08-15 and allow any indexes on that column to be candidates. Note that MySQL might NOT use that index, it depends on other factors.
Your best bet is to make a dual index on listentrack(url, date_created)
and then another index on music.url
These 2 indexes will cover this particular query.
Note that if you run EXPLAIN on this query you are still going to get a using filesort because it has to write the records to a temporary table on disk to do the ORDER BY.
In general you should always run your query under EXPLAIN to get an idea on how MySQL will execute the query and then go from there. See the EXPLAIN documentation:
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
Try creating an index that will help with the join:
CREATE INDEX idx_url ON music (url);
I think I might have missed the obvious before. Why are you joining the music table at all? You do not appear to be using the data in that table at all and you are performing a left join which is not required, right? I think this table being in the query will make it much slower and will not add any value. Take all references to music out, unless the url inclusion is required, in which case you need a right join to force it to not include a row without a matching value.
I would add new indexes, as the others mention. Specifically I would add:
music url
listentrack date_created,url
This will improve your join a ton.
Then I would look at the query, you are forcing the system to perform work on each row of the table. It would be better to rephrase the date restriction as a range.
Not sure of the syntax off the top of my head:
where '2009-08-15 00:00:00' <= date_created < 2009-08-16 00:00:00
That should allow it to rapidly use the index to locate the appropriate records. The combined two key index on music should allow it to find the records based on the date and URL. You should experiment, they might be better off going in the other direction url,date_created on the index.
The explain plan for this query should say "using index" on the right hand column for both. That means that it will not have to hit the data in the table to calculate your sums.
I would also check the memory settings that you have configured for MySQL. It sounds like you do not have enough memory allocated. Be very careful on the differences between server based settings and thread based settings. The server with a 10MB cache is pretty small, a thread with a 10MB cache can use a lot of memory quickly.
Jacob
Pre-grouping and then joining makes things a lot faster with MySQL/MyISAM. (I'm suspicious less of this is needed with other DB's)
This should perform about as fast as the non-joined version:
SELECT
total, a.url, title
FROM
(
SELECT COUNT(*) as total, url
from listenTrack
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY url
ORDER BY total DESC
LIMIT 0,10
) as a
LEFT JOIN music ON music.url = a.url
;
P.S. - Mapping between the two tables with an id instead of a url is sound advice.
Why are you repeating the url in both tables?
Have listentrack hold a music_id instead, and join on that. Gets rid of the text search as well as the extra index.
Besides, it's arguably more correct. You're tracking the times that a particular track was listened to, not the url. What if the url changes?
After you add indexes then you may want to explore adding a new column for the date_created to be a unix_timestamp, which will make math operations quicker.
I am not certain why you have the diff function though, as it appears you are looking for all rows that were updated on a particular date.
You may want to look at your query as it seems to have an error.
If you use unit tests then you can compare the results of your query and a query using a unix timestamp instead.
you might want to add an index to the url field of both tables.
having said that, when i converted from mysql to sql server 2008, with the same queries and same database structures, the queries ran 1-3 orders of magnitude faster.
i think some of it had to do with the rdbms (mysql optimizers are not so good...) and some of it might have had to do with how the rdbms reserve system resources. although, the comparisons were made on production systems where only the db would run.
This below would probably work to speed up the query.
CREATE INDEX music_url_index ON music (url) USING BTREE;
CREATE INDEX listenTrack_url_index ON listenTrack (url) USING BTREE;
You really need to know the total number of comparisons and row scans that are happening. To get that answer look at the code here of how to do that using explain http://www.siteconsortium.com/h/p1.php?id=mysql002.