My table
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
userid int(11) NO MUL NULL
title varchar(50) YES NULL
hosting varchar(10) YES NULL
zipcode varchar(5) YES NULL
lat varchar(20) YES NULL
long varchar(20) YES NULL
msg varchar(1000)YES MUL NULL
time datetime NO NULL
That is the table. I have simulated 500k rows of data and deleted randomly 270k rows to leave only 230k with an auto increment of 500k.
Here are my indexs
Keyname Type Unique Packed Field Cardinality Collation Null
PRIMARY BTREE Yes No id 232377 A
info BTREE No No userid 2003 A
lat 25819 A YES
long 25819 A YES
title 25819 A YES
time 25819 A
With that in mind , here is my query:
SELECT * FROM posts WHERE long>-118.13902802886 AND long<-118.08130797114 AND lat>33.79987197114 AND lat<33.85759202886 ORDER BY id ASC LIMIT 0, 25
Showing rows 0 - 15 (16 total, Query took 1.5655 sec) [id: 32846 - 540342]
The query only brought me 1 page, but because it had to search all 230k records it still took 1.5 seconds.
Here is the query explained:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE posts index NULL PRIMARY 4 NULL 25 Using where
So even if i use where clauses to only get back 16 results I still get a slow query.
Now for example if i do a broader search :
SELECT * FROM `posts` WHERE `long`>-118.2544681443 AND `long`<-117.9658678557 AND `lat`>33.6844318557 AND `lat`<33.9730321443 ORDER BY id ASC LIMIT 0, 25
Showing rows 0 - 24 (25 total, Query took 0.0849 sec) [id: 691 - 29818]
It is much faster when retrieving the first page out of 20 pages and 483 found total but i limit to 25.
but if i ask for the last page
SELECT * FROM `posts` WHERE `long`>-118.2544681443 AND `long`<-117.9658678557 AND `lat`>33.6844318557 AND `lat`<33.9730321443 ORDER BY id ASC LIMIT 475, 25
Showing rows 0 - 7 (8 total, Query took 1.5874 sec) [id: 553198 - 559593]
I get a slow query.
My question is how do I achieve good pagination? When the website goes live I expect when it takes off that posts will be deleted and made daily by the hundreds.
Posts should be ordered by id or timestamp and Id is not sequential because some records will be deleted.
I want to have a standard pagination
1 2 3 4 5 6 7 8 ... [Last Page]
Filter from your results records which appeared on earlier pages by using a WHERE clause: then you do not need to specify an offset, only a row count. For example, keep track of the last id or timestamp seen and filter for only those records with id or timestamp greater than that.
unfortunately mysql has to read [and earlier sort] all the 20000 rows before it outputs your 30 results. if you can try narrowing down your search using filtering on indexed columns within WHERE clause.
Few remarks.
given that you order by id, it means that on each page you have id for first and last record, so rather than limit 200000, you should use where id > $last_id limit 20 and that would be blazingly fast.
drawback is obviously that you cannot offer "last" page or any page in between, if id's are not sequential (deleted in between). you may then use combination of the last known id and offset + limit combination.
and obviously, having proper indexes will also help sorting and limiting.
it looks like you only have a primary key index. you might want to define an index on the fields you use, such as:
create index idx_posts_id on posts (`id` ASC);
create index idx_posts_id_timestamp on posts (`id` ASC, `timestamp` ASC);
having a regular index on your key field, besides your primary unique key index, usually helps speed up mysql, by, A LOT.
Mysql loses quite a bit of performance with a large offset: from the mysqlPerformance blog:
Beware of large LIMIT Using index to sort is efficient if you need first few rows, even if some extra filtering takes place so you need to scan more rows by index then requested by LIMIT. However if you’re dealing with LIMIT query with large offset efficiency will suffer. LIMIT 1000,10 is likely to be way slower than LIMIT 0,10. It is true most users will not go further than 10 page in results, however Search Engine Bots may very well do so. I’ve seen bots looking at 200+ page in my projects. Also for many web sites failing to take care of this provides very easy task to launch a DOS attack – request page with some large number from few connections and it is enough. If you do not do anything else make sure you block requests with too large page numbers.
For some cases, for example if results are static it may make sense to precompute results so you can query them for positions.
So instead of query with LIMIT 1000,10 you will have WHERE position between 1000 and 1009 which has same efficiency for any position (as long as it is indexed)
If you are using AUTO INCREMENT you may use:
SELECT *
FROMposts
WHEREid>= 200000 ORDER BYidDESC
LIMIT 200000 , 30
This way mysql will have to traverse only rows above 200000.
I figured it out. What was slowing me down is order by. Since I would call a limit and the the further down I asked to go the more it had to sort. So then i fixed it by adding a subquery to first extract the data I want with WERE clause then I used ORDER BY and LIMIT
SELECT * FROM
(SELECT * from `posts` as `p`
WHERE
`p`.`long`>-119.2544681443
AND `p`.`long`<-117.9658678557
AND `p`.`lat`>32.6844318557 A
ND `p`.`lat`<34.9730321443
) as posttable
order by id desc
limit x,n
By doing that I achieved the following:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 3031 Using filesort
2 DERIVED p ALL NULL NULL NULL NULL 232377 Using where
Now I filter 232k results using "where" and only orderby and limit 3031 results.
Showing rows 0 - 3030 (3,031 total, Query took 0.1431 sec)
Related
I am making a PHP backend API which executes a query on MySQL database. This is the query:
SELECT * FROM $TABLE_GAMES WHERE
($GAME_RECEIVERID = '$userId'OR $GAME_OTHERID = '$userId')
ORDER BY $GAME_ID LIMIT 1"
Essentially, I'm passing $userId as parameter, and getting row with smallest $GAME_ID value and it would return result in less than 100 ms for users that have around 30 000 matching rows in table. However, I have since added new users, that have around <100 matching rows, and query is painfully slow for them, taking around 20-30 seconds every time.
I'm puzzled to why the query is so much slower in situations where it is supposed to return low amount of rows, and extremely fast when returns huge amount of rows especially since I have ORDER BY.
I have read about parameter sniffing, but as far as I know, that's the SQL Server thing, and I'm using MySQL.
EDIT
Here is the SHOW CREATE statement:
CREATE TABLEgames(
IDint(11) NOT NULL AUTO_INCREMENT,
SenderIDint(11) NOT NULL,
ReceiverIDint(11) NOT NULL,
OtherIDint(11) NOT NULL,
Timestamptimestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (ID)
) ENGINE=MyISAM AUTO_INCREMENT=17275279 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Here is the output of EXPLAIN
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
| 1 | SIMPLE | games | NULL | index | NULL | PRIMARY | 4 | NULL | 1 |
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE games NULL index NULL PRIMARY 4 NULL 1 19.00 Using where
I tried prepared statement, but still getting the same result.
Sorry for poor formatting, I'm still noob at this.
You need to use EXPLAIN to analyse the performance of the query.
i.e.
EXPLAIN SELECT * FROM $TABLE_GAMES WHERE
($GAME_RECEIVERID = '$userId'OR $GAME_OTHERID = '$userId')
ORDER BY $GAME_ID LIMIT 1"
The EXPLAIN would provide the information about the select query with execution plan.
It is great tool to identify the slowness in the query. Based on the obtained information you can create the Indexes for the columns used in WHERE clause .
CREATE INDEX index_name ON table_name (column_list)
This would definitely increase the performance of the query.
Your query is being slow because it cannot find a matching record fast enough. With users where a lot of rows match, chances of finding a record to return are much higher, all other things being equal.
That behavior appears when $GAME_RECEIVERID and $GAME_OTHERID aren't part of an index, prompting MySQL to use the index on $GAME_ID because of the ordering. However, since newer players have not played the early games, there are literally millions of rows that won't match, but have to be checked nonetheless.
Unfortunately, this is bound to get worse even for old users, as your database grows. Ideally, you will add indexes on $GAME_RECEIVERID and $GAME_OTHERID - something like:
ALTER TABLE games
ADD INDEX receiver (ReceiverID),
ADD INDEX other (OtherID)
PS: Altering a 17 million rows table is going to take a while, so make sure to do it during a maintenance window or similar if this is used in production.
Is this the query after the interpolation? That is, is this what MySQL will see?
SELECT * FROM GAMES
WHERE RECEIVERID = '123'
OR OTHERID = '123'
ORDER BY ID LIMIT 1
Then this will run fast, regardless:
SELECT *
FROM GAMES
WHERE ID = LEAST(
( SELECT MIN(ID) FROM GAMES WHERE RECEIVERID = '123' ),
( SELECT MIN(ID) FROM GAMES WHERE OTHERID = '123' )
);
But, you will need both of these:
INDEX(RECEIVERID, ID),
INDEX(OTHERID, ID)
Your version of the query is scanning the table until it finds a matching row. My version will
make two indexed lookups;
fetch the other columns for the one row.
It will be the same, fast, speed regardless of how many rows there are for USERID.
(Recommend switching to InnoDB.)
I'm with a problem, I am working on highscores, and for those highscores you need to make a ranking based on skill experience and latest update time (to see who got the highest score first incase skill experience is the same).
The problem is that with the query I wrote, it takes 28 (skills) x 0,7 seconds to create a personal highscore page to see what their rank is on the list. Requesting this in the browser is just not doable, it takes way too long for the page to load and I need a solution for my issue.
MySQL version: 5.5.47
The query I wrote:
SELECT rank FROM
(
SELECT hs.playerID, (#rowID := #rowID + 1) AS rank
FROM
(
SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
) highscore,
(SELECT #rowID := 0) r
) data
WHERE data.playerID = ?
As you can see I first have to create a whole resultset that gives me a full ranking for that game mode and skill, and then I have to select the rank based on the playerID after that, the problem is that I cannot let the query run untill it finds the result, because mysql doesn't offer such function, if I'd specifiy where data.playerID = ? in the query above, it would give back 1 result, meaning the ranking will be 1 as well.
The highscores table has 550k rows
What I have tried was storing the resultset for each skillid/gamemode combination in a temp table json_encoded, tried storing on files, but it ended up being quite slow as well, because the files are really huge and it takes time to process.
Highscores table:
CREATE TABLE `highscores` (
`playerID` INT(11) NOT NULL,
`skillID` INT(10) NOT NULL,
`skillLevel` INT(10) NOT NULL,
`skillExperience` INT(10) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
PRIMARY KEY (`playerID`, `skillID`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Overall table has got 351k rows
Overall table:
CREATE TABLE `overall` (
`playerID` INT(11) NOT NULL,
`playerName` VARCHAR(50) NOT NULL,
`totalLevel` INT(10) NOT NULL,
`totalExperience` BIGINT(20) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
`game_mode` ENUM('REGULAR','IRON_MAN','IRON_MAN_HARDCORE') NOT NULL DEFAULT 'REGULAR',
PRIMARY KEY (`playerID`, `playerName`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Explain Select result from the query:
Does anybody have a solution for me?
No useful index for WHERE
The last 2 lines of the EXPLAIN (#3 DERIVED):
WHERE hs.skillID = ?
AND o.game_mode = ?
Since neither table has a suitable index to use for the WHERE clause, to optimizer decided to do a table scan of one of them (overall), then reach into the other (highscores). Having one of these indexes would help, at least some:
highscores: INDEX(skillID)
overall: INDEX(game_mode, ...) -- note that an index only on a low-cardinality ENUM is rarely useful.
(More in a minute.)
No useful index for ORDER BY
The optimizer sometimes decides to use an index for the ORDER BY instead of for the WHERE. But
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
cannot use an index, even though both are in the same table. This is because DESC and ASC are different. Changing ASC to DESC would have an impact on the resultset, but would allow
INDEX(skillExperience, updateTime)
to be used. Still, this may not be optimal. (More in a minute.)
Covering index
Another form of optimization is to build a "covering index". That is an index that has all the columns that the SELECT needs. Then the query can be performed entirely in the index, without reaching over to the data. The SELECT in question is the innermost:
( SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC, hs.updateTime ASC
) highscore,
For hs: INDEX(skillID, skillExperience, updateTime, playerID) is "covering" and has the most important item (skillID, from the WHERE) first.
For o: INDEX(game_mode, playerID) is "covering". Again, game_mode must be first.
If you change the ORDER BY to be DESC and DESC, then add another index for hs: INDEX(skillExperience, updateTime, skillID, playerID). Now the first 2 columns must be in that order.
Conclusion
It is not obvious which of those indexes the optimizer would prefer. I suggest you add both and let it choose.
I believe that (1) the innermost query is consuming the bulk of time, and (2) there is nothing to optimize in the outer SELECTs. So, I leave that as my recommendation.
Much of this is covered in my Indexing Cookbook.
Important subanswer: How frequently change rank of all players? Hmm.. Need explain.. You want realtime statistics? No, you dont want realtime )) You must select time interval for update statistics, e.g. 10 minutes. For this case you can run cronjob for insert new rank statistics into separated table like this:
/* lock */
TRUNCATE TABLE rank_stat; /* maybe update as unused/old for history) instead truncate */
INSERT INTO rank_stat (a, b, c, d) <your query here>;
/* unlock */
and users (browsers) will select readonly statistics from this table (can be split to pages).
But if rank stat not frequently change, e.g. you can recalculate it for all wanted game events and/or acts/achievs of players.
This is recommedations only. Because you not explain full environment. But I think you can found right solution with this recommendations.
It doesn't look like you really need to rank everyone, you just want to find out how many people are ahead of the current player. You should be able to get a simple count of how many players have better scores & dates than the current player which represents the current player's ranking.
SELECT count(highscores.id) as rank FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where highscores.skillID = ?
AND highscores.gamemode = ?
and playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
(I joined the table to itself and aliased the second instance as playerscore so it was slightly less confusing)
You could probably even simplify it to one query by grouping and parsing the results within your language of choice.
SELECT
highscores.gamemode as gamemode,
highscores.skillID as skillID,
count(highscores.id) as rank
FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
group by highscores.gamemode, highscores.skillID;
Not quite sure about the grouping bit though.
I have a database for donor and ticket sale information for a small non-profit. I'm trying to get a quick mailing list export based on people who have donated, bought a season ticket, or bought a single ticket. The "entity" table is the contact info, etc, and then the other tables hold info about the donation (year, amount, check date, etc) and has a field for "entityno" which matches it up to the primary key of entity.recordno.
Here's the query I'm running:
SELECT *
FROM
entity
LEFT JOIN individual_donation ON entity.recordno = individual_donation.entityno
LEFT JOIN season_tickets ON entity.recordno = season_tickets.entityno
LEFT JOIN single_tickets ON entity.recordno = single_tickets.entityno
WHERE
entity.ind_org = 'ind' AND
entity.address1 <> "" AND
(individual_donation.year <> 'NULL'
OR season_tickets.year <> 'NULL'
OR single_tickets.year <> 'NULL')
GROUP BY entity.lastname
ORDER BY entity.lastname ASC
This database is on BlueHost, and I'm accessing it through PHPmyadmin. The strange thing is that the query runs just fine when I preview it in PHPmyadmin - it returns 216 rows, and I can view all the rows within the SQL command browser and it loads just fine.
The problem is that every time I use PHPmyadmin's "export" command under the query results operations, I get the following error:
#1104 - The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET MAX_JOIN_SIZE=# if the SELECT is okay
Each of the tables is only about 300-400 rows at most, so I'm surprised that I'm getting a MAX_JOIN_SIZE error. It's also really strange to me that the sql query works just fine as is, but won't work on the export??
I'm sure I could do better JOINs etc, but I don't understand why the query runs fine, but just won't export.
EDIT:
here's the EXPLAIN EXTENDED result
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE entity ALL NULL NULL NULL NULL 429 100.00 Using where; Using temporary; Using filesort
1 SIMPLE individual_donation ALL NULL NULL NULL NULL 221 100.00
1 SIMPLE season_tickets ALL NULL NULL NULL NULL 102 100.00
1 SIMPLE single_tickets ALL NULL NULL NULL NULL 217 100.00 Using where
Further Information:
Strange - my webhost doesn't allow FILE permissions for mysql users, so I can't use EXPORT INTO. I tried using ssh access, running the query to > into a file, and I get the MAX_JOIN_SIZE error. I still don't understand why it would work in the phpmyadmin query in the browser just fine, but not export in phpmyadmin, nor work from the command line.
Doing a better job with indexes seems to have solved my problem, although still not sure why.
I made sure that the "entityno" column in the three referenced tables, which is the reference to the primary key in the entity table, were set as indexes. This seems to have solved whatever was causing the large number of returned rows in some intermediate step with my query. For reference, this is now the explain extended result:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE entity ALL NULL NULL NULL NULL 429 100.00 Using where; Using temporary; Using filesort
1 SIMPLE individual_donation ref entityno entityno 3 dakotask_ds1.entity.recordno 3 100.00
1 SIMPLE season_tickets ref entityno entityno 3 dakotask_ds1.entity.recordno 2 100.00
1 SIMPLE single_tickets ref entityno entityno 3 dakotask_ds1.entity.recordno 1 100.00 Using where
Try running as a query previous executing your main query
mysql_query("SET SQL_BIG_SELECTS=1");
You can try:
$query = "SELECT ... ";
mysqli_query($databaseConnection, "SET OPTION SQL_BIG_SELECTS=1");
$results = mysqli_query($databaseConnection, $query);
There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)
Query:
select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100
Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:
select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100
This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).
I don't know how to improve this using MySQL. Is there a way?
For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)
The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.
If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?
What should I do? Keep working with MySQL with distributed database on multiple machines or build a cassandra cluster?
I was asked to post what are the indexes, here they are:
mysql> show index in thread;
PRIMARY id
forumid
dateline
replycount
Select explain:
mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| 1 | SIMPLE | thread | ref | forumid | forumid | 4 | const,const | 221575 | Using where; Using filesort |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
then design your system something along the lines of the following simplified example:
Example schema (simplified)
The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.
threads - primary key (forum_id, thread_id)
forum_id thread_id
======== =========
1 1
1 2
1 3
1 ...
1 2058300
2 1
2 2
2 3
2 ...
2 2352141
...
Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.
forum_id title next_thread_id
======== ===== ==============
1 forum 1 2058300
2 forum 2 2352141
3 forum 3 2482805
4 forum 4 3740957
...
64 forum 64 3243097
65 forum 65 15000000 -- ooh a big one
66 forum 66 5038900
67 forum 67 4449764
...
247 forum 247 0 -- still loading data for half the forums !
248 forum 248 0
249 forum 249 0
250 forum 250 0
The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:
select * from threads where thread_id = y;
you have to do:
select * from threads where forum_id = x and thread_id = y;
However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...
Here's the simplified schema:
drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;
drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;
delimiter #
create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;
select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
set new.thread_id = v_id;
update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#
delimiter ;
You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.
Example queries
I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:
250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)
I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:
forum_id title next_thread_id
======== ===== ==============
65 forum 65 15000000 -- ooh a big one
Query runtimes
select sum(next_thread_id) from forums;
sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)
under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:
select count(*) from threads;
How many threads does forum 65 have:
select next_thread_id from forums where forum_id = 65
next_thread_id
==============
15,000,000 (15 million)
again this is faster than the usual:
select count(*) from threads where forum_id = 65
Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;
runtime = 0.022 secs
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;
runtime = 0.027 secs
Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)
Further optimisations
These would include:
partitioning by range
sharding
throwing money and hardware at it
etc...
hope you find this answer helpful :)
EDIT: Your one-column indices are not enough. You would need to, at least, cover the three involved columns.
More advanced solution: replace replycount > 1 with hasreplies = 1 by creating a new hasreplies field that equals 1 when replycount > 1. Once this is done, create an index on the three columns, in that order: INDEX(forumid, hasreplies, dateline). Make sure it's a BTREE index to support ordering.
You're selecting based on:
a given forumid
a given hasreplies
ordered by dateline
Once you do this, your query execution will involve:
moving down the BTREE to find the subtree that matches forumid = X. This is a logarithmic operation (duration : log(number of forums)).
moving further down the BTREE to find the subtree that matches hasreplies = 1 (while still matching forumid = X). This is a constant-time operation, because hasreplies is only 0 or 1.
moving through the dateline-sorted subtree in order to get the required results, without having to read and re-sort the entire list of items in the forum.
My earlier suggestion to index on replycount was incorrect, because it would have been a range query and thus prevented the use of a dateline to sort the results (so you would have selected the threads with replies very fast, but the resulting million-line list would have had to be sorted completely before looking for the 100 elements you needed).
IMPORTANT: while this improves performance in all cases, your huge OFFSET value (10000!) is going to decrease performance, because MySQL does not seem to be able to skip ahead despite reading straight through a BTREE. So, the larger your OFFSET is, the slower the request will become.
I'm afraid the OFFSET problem is not automagically solved by spreading the computation over several computations (how do you skip an offset in parallel, anyway?) or moving to NoSQL. All solutions (including NoSQL ones) will boil down to simulating OFFSET based on dateline (basically saying dateline > Y LIMIT 100 instead of LIMIT Z, 100 where Y is the date of the item at offset Z). This works, and eliminates any performance issues related to the offset, but prevents going directly to page 100 out of 200.
There is are part of question which related to NoSQL or MySQL option. Actually this is one fundamental thing hidden here. SQL language is easy to write for human and bit difficult to read for computer. In high volume databases I would recommend to avoid SQL backend as this requires extra step - command parsing. I have done extensive benchmarking and there are cases when SQL parser is slowest point. There is nothing you can do about it. Ok, you can possible use pre-parsed statements and access them.
BTW, it is not wide known but MySQL has grown out from NoSQL database. Company where authors of MySQL David and Monty worked was data warehousing company and they often had to write custom solutions for uncommon tasks. This leaded to big stack of homebrew C libraries used to manually write database functions when Oracle and other were performing poorly. SQL was added to this nearly 20 years old zoo on 1996 for fun. What came after you know.
Actually you can avoid SQL overhead with MySQL. But usually SQL parsing is not the slowest part but just good to know. To test parser overhead you may just make benchmark for "SELECT 1" for example ;).
You should not be trying to fit a database architecture to hardware you're planning to buy, but instead plan to buy hardware to fit your database architecture.
Once you have enough RAM to keep the working set of indexes in memory, all your queries that can make use of indexes will be fast. Make sure your key buffer is set large enough to hold the indexes.
So if 12GB is not enough, don't use 10 servers with 12GB of RAM, use fewer with 32GB or 64GB of RAM.
Indices are a must - but remember to choose the right type of index: BTREE is more suitable when using queries with "<" or ">" in your WHERE clauses, while HASH is more suitable when you have many distinct values in one column and you are using "=" or "<=>" in your WHERE clause.
Further reading http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I'm having approx. 200K rows in a table tb_post, and every 5 minutes it has approx. 10 new inserts.
I'm using following query to fetch the rows -
SELECT tb_post.ID, tb_post.USER_ID, tb_post.TEXT, tb_post.RATING, tb_post.CREATED_AT,
tb_user.ID, tb_user.NAME
FROM tb_post, tb_user
WHERE tb_post.USER_ID=tb_user.ID
ORDER BY tb_post.RATING DESC
LIMIT 30
It's taking more than 10sec to fetch all the rows in sorted fashion.
Following is the report of EXPLAIN query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tb_user ALL PRIMARY NULL NULL NULL 20950 Using temporary; Using filesort
1 SIMPLE tb_post ref tb_post_FI_1 tb_post_FI_1 4 tb_user.id 4
Few inputs:
tb_post.RATING is Float type
There is index on tb_post.USER_ID
Can anyone suggest me few pointers about how should I optimize this query and improve its read performance?
PS: I'm newbie in database scaling issues. So any kinds of suggestions will be useful specific to this query.
You need an index for tb_post that covers both the ORDER BY and WHERE clause.
CREATE INDEX idx2 on tb_post (rating,user_id)
=> output of EXPLAIN SELECT ...ORDER BY tb_post.RATING DESC LIMIT 30
"id";"select_type";"table";"type";"possible_keys";"key";"key_len";"ref";"rows";"Extra"
"1";"SIMPLE";"tb_post";"index";NULL;"idx2";"10";NULL;"352";""
"1";"SIMPLE";"tb_user";"eq_ref";"PRIMARY";"PRIMARY";"4";"test.tb_post.USER_ID";"1";""
You could try to index tb_post.RATING: MySQL can sometimes use indexes to optimize ORDER BY clauses : http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
If you're trying to aggregate data from different tables, you could also check which type of join ( http://en.wikipedia.org/wiki/Join_(SQL) ) you want. Some are better than others, depending on what you want.
What happens if you take the ORDER BY off, does that have a performance impact? If that has a large effect then maybe consider indexing tb_post.RATING.
Karl