Optimizing an SQL query with generated GROUP BY statement - php

I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1

Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)

I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?

As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?

I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC

If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Related

Slow GroupBy query for filter results Laravel [duplicate]

I have 2 tables. 1 is music and 2 is listenTrack. listenTrack tracks the unique plays of each song. I am trying to get results for popular songs of the month. I'm getting my results but they are just taking too long. Below is my tables and query
430,000 rows
CREATE TABLE `listentrack` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sessionId` varchar(50) NOT NULL,
`url` varchar(50) NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ip` varchar(150) NOT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=731306 DEFAULT CHARSET=utf8
12500 rows
CREATE TABLE `music` (
`music_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`title` varchar(50) DEFAULT NULL,
`artist` varchar(50) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`genre` int(4) DEFAULT NULL,
`file` varchar(255) NOT NULL,
`url` varchar(50) NOT NULL,
`allow_download` int(2) NOT NULL DEFAULT '1',
`plays` bigint(20) NOT NULL,
`downloads` bigint(20) NOT NULL,
`faved` bigint(20) NOT NULL,
`dateadded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`music_id`)
) ENGINE=MyISAM AUTO_INCREMENT=15146 DEFAULT CHARSET=utf8
SELECT COUNT(listenTrack.url) AS total, listenTrack.url
FROM listenTrack
LEFT JOIN music ON music.url = listenTrack.url
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY listenTrack.url
ORDER BY total DESC
LIMIT 0,10
this query isn't very complex and the rows aren't too large, i don't think.
Is there any way to speed this up? Or can you suggest a better solution? This is going to be a cron job at the beggining of every month but I would also like to do by the day results as well.
Oh btw i am running this locally, over 4 min to run, but on prod it takes about 45 secs
I'm more of a SQL Server guy but these concepts should apply.
I'd add indexes:
On ListenTrack, add an index with url, and date_created
On Music, add an index with url
These indexes should speed the query up tremendously (I originally had the table names mixed up - fixed in the latest edit).
For the most part you should also index any column that is used in a JOIN. In your case, you should index both listentrack.url and music.url
#jeff s - An index music.date_created wouldnt help because you are running that through a function first so MySQL cannot use an index on that column. Often, you can rewrite a query so that the indexed referenced column is used statically like:
DATEDIFF(DATE(date_created),'2009-08-15') = 0
becomes
date_created >= '2009-08-15' and date_created < '2009-08-15'
This will filter down records that are from 2009-08-15 and allow any indexes on that column to be candidates. Note that MySQL might NOT use that index, it depends on other factors.
Your best bet is to make a dual index on listentrack(url, date_created)
and then another index on music.url
These 2 indexes will cover this particular query.
Note that if you run EXPLAIN on this query you are still going to get a using filesort because it has to write the records to a temporary table on disk to do the ORDER BY.
In general you should always run your query under EXPLAIN to get an idea on how MySQL will execute the query and then go from there. See the EXPLAIN documentation:
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
Try creating an index that will help with the join:
CREATE INDEX idx_url ON music (url);
I think I might have missed the obvious before. Why are you joining the music table at all? You do not appear to be using the data in that table at all and you are performing a left join which is not required, right? I think this table being in the query will make it much slower and will not add any value. Take all references to music out, unless the url inclusion is required, in which case you need a right join to force it to not include a row without a matching value.
I would add new indexes, as the others mention. Specifically I would add:
music url
listentrack date_created,url
This will improve your join a ton.
Then I would look at the query, you are forcing the system to perform work on each row of the table. It would be better to rephrase the date restriction as a range.
Not sure of the syntax off the top of my head:
where '2009-08-15 00:00:00' <= date_created < 2009-08-16 00:00:00
That should allow it to rapidly use the index to locate the appropriate records. The combined two key index on music should allow it to find the records based on the date and URL. You should experiment, they might be better off going in the other direction url,date_created on the index.
The explain plan for this query should say "using index" on the right hand column for both. That means that it will not have to hit the data in the table to calculate your sums.
I would also check the memory settings that you have configured for MySQL. It sounds like you do not have enough memory allocated. Be very careful on the differences between server based settings and thread based settings. The server with a 10MB cache is pretty small, a thread with a 10MB cache can use a lot of memory quickly.
Jacob
Pre-grouping and then joining makes things a lot faster with MySQL/MyISAM. (I'm suspicious less of this is needed with other DB's)
This should perform about as fast as the non-joined version:
SELECT
total, a.url, title
FROM
(
SELECT COUNT(*) as total, url
from listenTrack
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY url
ORDER BY total DESC
LIMIT 0,10
) as a
LEFT JOIN music ON music.url = a.url
;
P.S. - Mapping between the two tables with an id instead of a url is sound advice.
Why are you repeating the url in both tables?
Have listentrack hold a music_id instead, and join on that. Gets rid of the text search as well as the extra index.
Besides, it's arguably more correct. You're tracking the times that a particular track was listened to, not the url. What if the url changes?
After you add indexes then you may want to explore adding a new column for the date_created to be a unix_timestamp, which will make math operations quicker.
I am not certain why you have the diff function though, as it appears you are looking for all rows that were updated on a particular date.
You may want to look at your query as it seems to have an error.
If you use unit tests then you can compare the results of your query and a query using a unix timestamp instead.
you might want to add an index to the url field of both tables.
having said that, when i converted from mysql to sql server 2008, with the same queries and same database structures, the queries ran 1-3 orders of magnitude faster.
i think some of it had to do with the rdbms (mysql optimizers are not so good...) and some of it might have had to do with how the rdbms reserve system resources. although, the comparisons were made on production systems where only the db would run.
This below would probably work to speed up the query.
CREATE INDEX music_url_index ON music (url) USING BTREE;
CREATE INDEX listenTrack_url_index ON listenTrack (url) USING BTREE;
You really need to know the total number of comparisons and row scans that are happening. To get that answer look at the code here of how to do that using explain http://www.siteconsortium.com/h/p1.php?id=mysql002.

index chat mysql table

I have this chat table:
CREATE TABLE IF NOT EXISTS `support_chat` (
`id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`from` varchar(255) NOT NULL DEFAULT '',
`to` varchar(255) NOT NULL DEFAULT '',
`message` text NOT NULL,
`sent` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`seen` varchar(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `from` (`from`),
KEY `to` (`to`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=1 ;
basically I need to do a select all the time (3s per user) to check new messages:
select id, `from`, message, sent from support_chat where `to` = ? and seen = 0
I have 5 million rows, usually 100 users online at the same time. Can I change something to make this table faster? key from and key to is a good option?
There isn't much you can do by way of indexes to speed up that particular query. You could have a composite index on the to and seen fields but the improvement will be minimal if at all. Why? Because the seen field has very poor cardinality. You only seem to be storing 0 or 1 in it and indexes on such columns are not very usefull. Often it would be faster for the query optimizer to read the data directly.
But here's what you can do Partition:
... enables you to distribute portions of individual tables across a
file system according to rules which you can set largely as needed. In
effect, different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function,
You can partition your data in such a way that very old data is separated from the new. This will probably give you a big boost. However be aware that if you have a query that fetches old data as well as new data that will be a lot slower.
Here is another thing you can do: Add a limit clause.
You are probably only showing a limited number of messages at any given time. Putting a limit clause will help. Then mysql knows that it doesn't need to look anymore after it has found N rows.
Add a multiple column index on to and seen columns in this particular order (to column should be the 1st column in the index). Then run explain select... on your query to see if the new index is used.
Assuming that the seen column stores 2 values only ('0' and '1') and that to column stores the recipient of the chat message (email, username), so it can have many more values, I'd use a composite index with seen first and to second:
ALTER TABLE support_chat
ADD INDEX seen_to_ix
(seen, `to`) ;
A composite index with reversed order (`to`, seen) would be a good choice, too. It might even be better depending on server load and how often the table is updated. An advantage (if you decide to use the second index), is that you can remove the (`to`) index.
Pick and add one of the two indexes and check the performance of your queries again.
Additional notes:
Using a varchar(1) for what is essentially a boolean value is not optimal. Even worse that it is a utf8mb4 charset. It uses 5 bytes! (1 for the variable and 4 for the single byte!)
I'd change the type of that column to tinyint (and store 0 and 1) or bit.
Please avoid using reserved words (eg, from, to) for table and column names.

Same MySql Query Long execution time but short on archive table with 6million more records

I am a bit stumped on this wierdness.
I have a gps tracking app that logs gps points into a track_log table.
When I do a basic query on the running log table it takes about 50 seconds to complete:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
When I run the exact same query on the archived table where I copied most of the logs to to reduce the running table's logs to about 1.2 million records.
The archive table is 7.5 million records big.
The exact same query on the archive table runs for 0.1 seconds on the same server even though it's six times bigger!
What's going on?
Here's the full Create Table schema:
CREATE TABLE `track_log` (
`id_track_log` INT(11) NOT NULL AUTO_INCREMENT,
`node_id` INT(11) DEFAULT NULL,
`client_id` INT(11) DEFAULT NULL,
`time_stamp` DATETIME NOT NULL,
`latitude` DOUBLE DEFAULT NULL,
`longitude` DOUBLE DEFAULT NULL,
`altitude` DOUBLE DEFAULT NULL,
`direction` DOUBLE DEFAULT NULL,
`speed` DOUBLE DEFAULT NULL,
`event_code` INT(11) DEFAULT NULL,
`event_description` VARCHAR(255) DEFAULT NULL,
`street_address` VARCHAR(255) DEFAULT NULL,
`mileage` INT(11) DEFAULT NULL,
`run_time` INT(11) DEFAULT NULL,
`satellites` INT(11) DEFAULT NULL,
`gsm_signal_status` DOUBLE DEFAULT NULL,
`hor_pos_accuracy` double DEFAULT NULL,
`positioning_status` char(1) DEFAULT NULL,
`io_port_status` char(16) DEFAULT NULL,
`AD1` decimal(10,2) DEFAULT NULL,
`AD2` decimal(10,2) DEFAULT NULL,
`AD3` decimal(10,2) DEFAULT NULL,
`battery_voltage` decimal(10,2) DEFAULT NULL,
`ext_power_voltage` decimal(10,2) DEFAULT NULL,
`rfid` char(8) DEFAULT NULL,
`pic_name` varchar(255) DEFAULT NULL,
`temp_sensor_no` char(2) DEFAULT NULL,
PRIMARY KEY (`id_track_log`),
UNIQUE KEY `id_track_log_UNIQUE` (`id_track_log`),
KEY `client_id_fk_idx` (`client_id`),
KEY `track_log_node_id_fk_idx` (`node_id`),
KEY `track_log_event_code_fk_idx` (`event_code`),
KEY `track_log_time_stamp_index` (`time_stamp`),
CONSTRAINT `track_log_client_id` FOREIGN KEY (`client_id`) REFERENCES `clients` (`client_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_event_code_fk` FOREIGN KEY (`event_code`) REFERENCES `event_codes` (`event_code`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_node_id_fk` FOREIGN KEY (`node_id`) REFERENCES `nodes` (`id_nodes`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=8632967 DEFAULT CHARSET=utf8
TL;DR
Make sure the indexes are defined in both tables, for this query node_id and time_stamp are good indexes.
Defragment your table: https://dev.mysql.com/doc/refman/5.5/en/innodb-file-defragmenting.html (This could help, but should not make this much of a difference).
Make sure your query is not being blocked by other queries. If data is being inserted in the track_log table at continuously, those queries might block your query. You can prevent this by changing the transaction isolation level, see https://dev.mysql.com/doc/refman/5.5/en/set-transaction.html for more information. Caution: be carefull with this!
Indexes
I'm guessing this has something to do with the indexes you defined on the tables. Could you post the SHOW CREATE TABLES track_log output and the output of your archive table as well? The query you are executing would require an index on node_id and time_stamp for optimal performance.
Defragmentation
Besides this indexes you defined on the table, this might have something to do with data fragmentation. I'm assuming you are using InnoDB as your table engine now. Depending on your settings, every table in a database is stored in a separate file or every table in the database is stored in a single file (innodb_file_per_table variable). Those files will never shrink in size. If your track_log table has grown to 8.7 million records, on disk, it still takes up space for all those 8.7 million records.
If you have moved records from your track_log table to your archive table, the data might still be at the beginning and the end of the physical file for track_log. If no index is defined at time_stamp, a full table scan is still required to order by the timestamp. This means: reading the complete file from disk. Because the records you deleted still take up space in the file, this could make a difference.
Edit:
Transactions
Other transactions might be blocking your SELECT query. This can happen with the InnoDB engine. If you continously insert a lot of data into your track_log table, those queries might block your query. It will have to wait until no other transactions are being performed at this table.
There is a way around this, but you should be careful with this. You are able to change to transaction isolation level of your query. By setting the transaction isolation level to READ UNCOMMITTED you will be able to read data, while the other inserts are running. But it might not always give you the latest data. If you want to sacrifice this depends on your situation. If you are going to alter the data and update the data later, you generally do not want to change the transaction isolation level. But, for example, when showing statistics which should not always be accurate and up to date, this could be something that really speeds up your query.
I use this myself sometimes when I need to show statistics from large tables which are updated regularly.
This is almost certainly because your archive table has superior indexing to your track_log table.
To satisfy this query efficiently you need a compound index on (node_id, time_stamp) Why does this work? Because InnoDB and MyISAM indexes are so-called BTREE indexes, which means our intuition about searching them in order will work. Your query looks for a specific value of node_id, which means it can jump to that value in the index efficiently. The query then calls for the highest possible value of time_stamp related to that node_id value. Now that's in the same index, and in the right order to access it quickly too. So the row you need can be random-accessed, and MySQL doesn't have to hunt for it by scanning the table row by row. That scanning is almost certainly what's taking the time in your query.
Three things to keep in mind:
One: lots of indexes on single columns can't help a query as much as well-chosen compound indexes. Read this http://use-the-index-luke.com/
Two: SELECT * is usually harmful on a table with as many columns as the one you have shown. Instead, you should enumerate the columns you actually need in your SELECT query. That way MySQL doesn't have to sling as much data.
Three: The DOUBLE datatype is overkill for commercial-grade GPS data. FLOAT is plenty of precision.
Let us analyze your query:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
The above mentioned query first sorts all the data present in the table based on time_stamp and then returns the top row.
But, when this query is executed on archived table, order by clause might be ignored (based on compression and system setting) and hence it returns the first row it encountered in the table.
You may verify the output of archived table by comparing the result with actual latest row.

Most efficient way to store data for a graph

I have come up with a total of three different, equally viable methods for saving data for a graph.
The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.
Method 1:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`category` ENUM('buildings','items',...) NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`userid`, `date`, `category`),
INDEX `userid` (`userid`),
INDEX `date` (`date`)
) ENGINE=InnoDB
This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:
DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)
Method 2:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`buildings-1day` FLOAT UNSIGNED NOT NULL,
`buildings-2day` FLOAT UNSIGNED NOT NULL,
... (and so on for each category up to `-7day`
PRIMARY KEY (`userid`)
)
Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:
... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...
Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.
Method 3:
Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.
Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.
Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.
Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?
If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.
The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.
... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...
You really want to update every single record in the table every day until the end of time?!
Selecting by user id is faster due to being a primary key
Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.
The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.
Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.
ETA: Regarding JSON stored in files
This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:
CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);
For example:
Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:
SELECT json_data FROM score_summaries
WHERE user_id=7
AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY);
//returns nothing so generate summary record
SELECT DATE(added_on), category, SUM(score)
FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY DATE(added_on), category; //never include today's data, encode as json with php
INSERT INTO score_summaries(user_id, json_data)
VALUES(7, '$json') //from PHP, in this case $json == NULL
ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)
//use $json for presentation too
Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.
method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating
CREATE TABLE `graphdata` (
`graphDataId` INT UNSIGNED NOT NULL,
`categoryId` INT NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB
than create 2 tables because you obviosuly need to have info connecting graphDataId with userId
create table 'graphDataUser'(
`graphDataId` INT UNSIGNED NOT NULL,
`userId` INT NOT NULL,
)ENGINE=InnoDB
and graphDataId date connection
create table 'graphDataDate'(
`graphDataId` INT UNSIGNED NOT NULL,
'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB
i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.

MySQL Slow on join. Any way to speed up

I have 2 tables. 1 is music and 2 is listenTrack. listenTrack tracks the unique plays of each song. I am trying to get results for popular songs of the month. I'm getting my results but they are just taking too long. Below is my tables and query
430,000 rows
CREATE TABLE `listentrack` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sessionId` varchar(50) NOT NULL,
`url` varchar(50) NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ip` varchar(150) NOT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=731306 DEFAULT CHARSET=utf8
12500 rows
CREATE TABLE `music` (
`music_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`title` varchar(50) DEFAULT NULL,
`artist` varchar(50) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`genre` int(4) DEFAULT NULL,
`file` varchar(255) NOT NULL,
`url` varchar(50) NOT NULL,
`allow_download` int(2) NOT NULL DEFAULT '1',
`plays` bigint(20) NOT NULL,
`downloads` bigint(20) NOT NULL,
`faved` bigint(20) NOT NULL,
`dateadded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`music_id`)
) ENGINE=MyISAM AUTO_INCREMENT=15146 DEFAULT CHARSET=utf8
SELECT COUNT(listenTrack.url) AS total, listenTrack.url
FROM listenTrack
LEFT JOIN music ON music.url = listenTrack.url
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY listenTrack.url
ORDER BY total DESC
LIMIT 0,10
this query isn't very complex and the rows aren't too large, i don't think.
Is there any way to speed this up? Or can you suggest a better solution? This is going to be a cron job at the beggining of every month but I would also like to do by the day results as well.
Oh btw i am running this locally, over 4 min to run, but on prod it takes about 45 secs
I'm more of a SQL Server guy but these concepts should apply.
I'd add indexes:
On ListenTrack, add an index with url, and date_created
On Music, add an index with url
These indexes should speed the query up tremendously (I originally had the table names mixed up - fixed in the latest edit).
For the most part you should also index any column that is used in a JOIN. In your case, you should index both listentrack.url and music.url
#jeff s - An index music.date_created wouldnt help because you are running that through a function first so MySQL cannot use an index on that column. Often, you can rewrite a query so that the indexed referenced column is used statically like:
DATEDIFF(DATE(date_created),'2009-08-15') = 0
becomes
date_created >= '2009-08-15' and date_created < '2009-08-15'
This will filter down records that are from 2009-08-15 and allow any indexes on that column to be candidates. Note that MySQL might NOT use that index, it depends on other factors.
Your best bet is to make a dual index on listentrack(url, date_created)
and then another index on music.url
These 2 indexes will cover this particular query.
Note that if you run EXPLAIN on this query you are still going to get a using filesort because it has to write the records to a temporary table on disk to do the ORDER BY.
In general you should always run your query under EXPLAIN to get an idea on how MySQL will execute the query and then go from there. See the EXPLAIN documentation:
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
Try creating an index that will help with the join:
CREATE INDEX idx_url ON music (url);
I think I might have missed the obvious before. Why are you joining the music table at all? You do not appear to be using the data in that table at all and you are performing a left join which is not required, right? I think this table being in the query will make it much slower and will not add any value. Take all references to music out, unless the url inclusion is required, in which case you need a right join to force it to not include a row without a matching value.
I would add new indexes, as the others mention. Specifically I would add:
music url
listentrack date_created,url
This will improve your join a ton.
Then I would look at the query, you are forcing the system to perform work on each row of the table. It would be better to rephrase the date restriction as a range.
Not sure of the syntax off the top of my head:
where '2009-08-15 00:00:00' <= date_created < 2009-08-16 00:00:00
That should allow it to rapidly use the index to locate the appropriate records. The combined two key index on music should allow it to find the records based on the date and URL. You should experiment, they might be better off going in the other direction url,date_created on the index.
The explain plan for this query should say "using index" on the right hand column for both. That means that it will not have to hit the data in the table to calculate your sums.
I would also check the memory settings that you have configured for MySQL. It sounds like you do not have enough memory allocated. Be very careful on the differences between server based settings and thread based settings. The server with a 10MB cache is pretty small, a thread with a 10MB cache can use a lot of memory quickly.
Jacob
Pre-grouping and then joining makes things a lot faster with MySQL/MyISAM. (I'm suspicious less of this is needed with other DB's)
This should perform about as fast as the non-joined version:
SELECT
total, a.url, title
FROM
(
SELECT COUNT(*) as total, url
from listenTrack
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY url
ORDER BY total DESC
LIMIT 0,10
) as a
LEFT JOIN music ON music.url = a.url
;
P.S. - Mapping between the two tables with an id instead of a url is sound advice.
Why are you repeating the url in both tables?
Have listentrack hold a music_id instead, and join on that. Gets rid of the text search as well as the extra index.
Besides, it's arguably more correct. You're tracking the times that a particular track was listened to, not the url. What if the url changes?
After you add indexes then you may want to explore adding a new column for the date_created to be a unix_timestamp, which will make math operations quicker.
I am not certain why you have the diff function though, as it appears you are looking for all rows that were updated on a particular date.
You may want to look at your query as it seems to have an error.
If you use unit tests then you can compare the results of your query and a query using a unix timestamp instead.
you might want to add an index to the url field of both tables.
having said that, when i converted from mysql to sql server 2008, with the same queries and same database structures, the queries ran 1-3 orders of magnitude faster.
i think some of it had to do with the rdbms (mysql optimizers are not so good...) and some of it might have had to do with how the rdbms reserve system resources. although, the comparisons were made on production systems where only the db would run.
This below would probably work to speed up the query.
CREATE INDEX music_url_index ON music (url) USING BTREE;
CREATE INDEX listenTrack_url_index ON listenTrack (url) USING BTREE;
You really need to know the total number of comparisons and row scans that are happening. To get that answer look at the code here of how to do that using explain http://www.siteconsortium.com/h/p1.php?id=mysql002.

Categories