I have 2 databases (users, userRankings) for a system that needs to have rankings updated every 10 minutes. I use the following code to update these rankings which works fairly well, but there is still a full table scan involved which slows things down with a few hundred thousand users.
mysql_query("TRUNCATE TABLE userRankings");
mysql_query("INSERT INTO userRankings (userid) SELECT id FROM users ORDER BY score DESC");
mysql_query("UPDATE users a, userRankings b SET a.rank = b.rank WHERE a.id = b.userid");
In the userRankings table, rank is the primary key and userid is an index. Both tables are MyISAM (I've wondered if it might be beneficial to make userRankings InnoDB).
Try something like this, it should give you the same effect without needing to build a separate table. (my MySql is a bit rusty, but this should work OK)
select #rownum:=#rownum+1 ‘rank’, a.id
from player a, (SELECT #rownum:=0) r
order by score desc
Which query is it that's actually taking most of the time? I'll wager it's the third one.
Benchmark your query by converting it to a SELECT:
SELECT a.rank, b.rank
FROM users a, userRankings b
WHERE a.id = b.userid
Explain can help you debug how indexes are being used.
The first thing you should do is drop that old table1,table2 syntax. It's difficult to read, and can be very inefficient. You should always use JOINs.
UPDATE users AS a
JOIN userRankings AS b ON a.id = b.userid
SET a.rank = b.rank
That may fix your problem right there - hopefully EXPLAIN can help you narrow in on specific issues. Make sure that both columns (users.id and userRankings.userid) are the exact same type (as in, one of them isn't signed while the other is, for example).
For more help, you would need to post more data (the SHOW CREATE TABLE results for each table, how long each of the queries take when you run them, the number of rows in the tables).
Related
I'm currently grabbing the last 5 comments posted on my website, which I am seemingly doing quite badly I think.
Here is the SQL query:
SELECT c.comment_id
, c.article_id
, c.time_posted
, a.title
, a.slug
, u.username
FROM articles_comments c
JOIN articles a
ON c.article_id = a.article_id
JOIN users u
ON u.user_id = c.author_id
WHERE a.active = 1
AND c.approved = 1
ORDER
BY c.comment_id DESC
LIMIT 5
My problem, is that has to search through a lot of rows, it seems quite wasteful. I'm curious if there's a better way to do it.
Here's the output of explain on it:
As you can see, the rows it's giving is 81,486 which seems kind of hilarious. What am I missing here?
Turns out, simply forcing articles_comments to use the PRIMARY key (comment_id) as the INDEX fixed it.
The issue is that my sorting is picking ALL approved comments, so it was using the approved column to sort resulting in it picking data from tens of thousands.
c: INDEX(approved, comment_id) -- in this order
a: I assume you have PRIMARY KEY(article_id)
u: I assume you have PRIMARY KEY(user_id)
The hope is that the c index will handle some of the WHERE, plus the ORDER BY and LIMIT. The worst case is that it must scan the entire table without finding 5 rows.
The 81,486 is bogus. Here's a precise way to get good info:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The 'reads' will indicate how many rows of data and index were touched; the writes indicate temp table(s) being used.
I'm currently running this query. However, when run outside phpMyAdmin it causes a 504 timeout error. I'm thinking it has to do with how efficient the number of rows is returned or accessed by the query.
I'm not extremely experienced with MySQL and so this was the best I could do:
SELECT
s.surveyId,
q.cat,
SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
user.division_id,
user.unit_id,
user.department_id,
user.team_id,
division.division_name,
unit.unit_name,
dpt.department_name,
team.team_name
FROM survey_answers_cache sac
JOIN surveys s ON s.surveyId = sac.surveyid
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
JOIN user ON user.user_id = sac.user_id
JOIN questions q ON q.question_id = sac.question_id
JOIN division ON division.division_id = user.division_id
LEFT JOIN unit ON unit.unit_id = user.unit_id
LEFT JOIN department dpt ON dpt.department_id = user.department_id
LEFT JOIN team ON team.team_id = user.team_id
WHERE c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
GROUP BY user.team_id, s.surveyId, q.cat
ORDER BY s.surveyId, user.team_id, q.cat ASC
The problem I get with this query is that when I get a correct result returned it runs quickly (let's say +-500ms) but when the result has twice as much rows, it takes more than 5 minutes and then causes a 504 timeout.
The other problem is that I didn't create this database myself, so I didn't set the indices myself. I'm thinking of improving these and therefore I used the explain command:
I see a lot of primary keys and a couple double indices, but I'm not sure if this would affect the performance this greatly.
EDIT: This piece of code takes up all the execution time:
$start_time = microtime(true);
$stmt = $conn->query($query); //query is simply the query above.
while ($row = $stmt->fetch_assoc()){
$resultSurveys["scores"][] = $row;
}
$stmt->close();
$end_time = microtime(true);
$duration = $end_time - $start_time; //value typically the execution time #reallyHigh...
So my question: Is it possible to (greatly?) improve the performance of the query by altering the database keys or should I divide my query into multiple smaller queries?
You can try something like this ( although its not practical for me to test this )
SELECT
sac.surveyId,
q.cat,
SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
user.division_id,
user.unit_id,
user.department_id,
user.team_id,
division.division_name,
unit.unit_name,
dpt.department_name,
team.team_name
FROM survey_answers_cache sac
JOIN
(
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
) AS v ON v.surveyid = sac.surveyid
JOIN user ON user.user_id = sac.user_id
JOIN questions q ON q.question_id = sac.question_id
JOIN division ON division.division_id = user.division_id
LEFT JOIN unit ON unit.unit_id = user.unit_id
LEFT JOIN department dpt ON dpt.department_id = user.department_id
LEFT JOIN team ON team.team_id = user.team_id
GROUP BY user.team_id, v.surveyId, q.cat
ORDER BY v.surveyId, user.team_id, q.cat ASC
So I hope I didn't mess anything up.
Anyway, the idea is in the inner query you select only the rows you need based on your where condition. This will create a smaller tmp table as it only pulls 2 fields both ints.
Then in the outer query you join to the tables that you actually pull the rest of the data from, order and group. This way you are sorting and grouping on a smaller dataset. And your where clause can run in the most optimal way.
You may even be able to omit some of these tables as your only pulling data from a few of them, but without seeing the full schema and how it's related that's hard to say.
But just generally speaking this part (The sub-query)
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
Is what is directly affected by your WHERE clause. See so we can optimize this part then use it to join the rest of the data you need.
An example of removing tables can be easily deduced from the above, consider this
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
WHERE
sc.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
The c table cluster is never used to pull data from, only for the where. So is not
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=?
The same as or equivalent to
WHERE
sc.cluster_id=?
And therefore we can eliminate that join completely.
The EXPLAIN result is showing signs of problem
Using temporary;using filesort: the ORDER BY needs to create temporary tables to do the sorting.
On 3rd row for user table type is ALL, key and ref are NULL: means that it needs to scan the whole table each time to retrieve results.
Suggestions:
add indexes on user.cluster_id and all fields involved on the ORDER BY and GROUP by clauses. Keep in mind that user table seems to be under changein database (cross database query).
Add indexes on user columns involved on JOINs.
Add index to s.survey_id
If possible, keep the same sequence for GROUP BY and ORDER BY clauses
According to the accepted answer in this question move the JOIN on user table to the first position in the join queue.
Carefully read this official documentation. You may need to optimize the server configuration.
PS: query optimization is an art that requires patience and hard work. No silver bullet for that.
Welcome to the fine art of optimizing MySQL!
i think the problem happends when you add this:
JOIN user ON user.cluster_id = sc.subcluster_id
JOIN survey_answers_cache sac ON (sac.surveyId = s.surveyId AND sac.user_id = user.user_id)
the extra condition sac.user_id = user.user_id can be easily not consistent.
Can you try do a second join with user table?
pd. can you add a "SHOW CREATE TABLE"
So I have a query that works exactly as intended but needs heavy optimization. I am using software to track load times across my site and 97.8% of all load time on my site is a result of this one function. So to explain my database a little bit before getting to the query.
First I have a films table, a competitions table and a votes table. Films can be in many competitions and competitions have many films, since this is a many to many relationship we have a pivot table to show their relationship (filmCompetition) When loading the competition page however these films need to be in order of their votes (most voted at the top. least at the bottom).
Now In the query below you can see what I am doing, grabbing from the films from the filmsCompetition table that match the current competition id $competition->id, and then I order by the total number of votes for that film. Like I said this works but is super efficient but I cannot think of another way to do it.
$films = DB::select( DB::raw("SELECT f.*, COUNT(v.value) AS totalVotes
FROM filmCompetition AS fc
JOIN films AS f ON f.id = fc.filmId AND fc.competitionId = '$competition->id'
LEFT JOIN votes AS v ON f.id = v.filmId AND v.competitionId = '$competition->id'
GROUP BY f.id
ORDER BY totalVotes DESC
") );
For this query, you want indexes on filmCompetition(competitionId, filmId), films(id), and votes(filmId, competitionId)`.
However, it is probably more efficient to write the query like this:
SELECT f.*,
(SELECT COUNT(v.value)
FROM votes v
WHERE v.filmId = f.id and v.competitionId = '$competition->id'
) AS totalVotes
FROM films f
WHERE EXISTS (SELECT 1
FROM FilmCompetition fc
WHERE fc.FilmId = f.Filmid AND
fc.competitionId = '$competition->id'
)
ORDER BY TotalVotes DESC
This saves the outer aggregation, which should be a performance win. For this version, the indexes are FilmCompetition(FilmId, CompetitionId) and Votes(FilmId, CompetitionId).
The solution I actually ended up using was a little different but since Gordon answered the question I marked him as the correct answer.
My Solution
To actually fix this I did a slightly different approach, rather than trying to do all of this in SQL I did my three queries separately and then joined them together in PHP. While this can be slower in my case doing it my original way took about 15 seconds, doing it Gordons way took about 7, and doing it my new way took about 600ms.
I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?
Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.
Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).
It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.
Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html
So I've got a little forum I'm trying to get data for, there are 4 tables, forum, forum_posts, forum_threads and users. What i'm trying to do is to get the latest post for each forum and giving the user a sneak peek of that post, i want to get the number of posts and number of threads in each forum aswell. Also, i want to do this in one query. So here's what i came up with:
SELECT lfx_forum_posts.*, lfx_forum.*, COUNT(lfx_forum_posts.pid) as posts_count,
lfx_users.username,
lfx_users.uid,
lfx_forum_threads.tid, lfx_forum_threads.parent_forum as t_parent,
lfx_forum_threads.text as t_text, COUNT(lfx_forum_threads.tid) as thread_count
FROM
lfx_forum
LEFT JOIN
(lfx_forum_threads
INNER JOIN
(lfx_forum_posts
INNER JOIN lfx_users
ON lfx_users.uid = lfx_forum_posts.author)
ON lfx_forum_threads.tid = lfx_forum_posts.parent_thread AND lfx_forum_posts.pid =
(SELECT MAX(lfx_forum_posts.pid)
FROM lfx_forum_posts
WHERE lfx_forum_posts.parent_forum = lfx_forum.fid
GROUP BY lfx_forum_posts.parent_forum)
)
ON lfx_forum.fid = lfx_forum_posts.parent_forum
GROUP BY lfx_forum.fid
ORDER BY lfx_forum.fid ASC
This get the latest post in each forum and gives me a sneakpeek of it, the problem is that
lfx_forum_posts.pid =
(SELECT MAX(lfx_forum_posts.pid)
FROM lfx_forum_posts
WHERE lfx_forum_posts.parent_forum = lfx_forum.fid
GROUP BY lfx_forum_posts.parent_forum)
Makes my COUNT(lfx_forum_posts.pid) go to one (aswell as the COUNT(lfx_forum_threads.tid) which isn't how i would like it to work. My question is: is there some somewhat easy way to make it show the correct number and at the same time fetch the correct post info (the latest one that is)?
If something is unclear please tell and i'll try to explain my issue further, it's my first time posting something here.
Hard to get an overview of the structure of your tables with only one big query like that.
Have you considered making a view to make it easier and faster to run the query?
Why do you have to keep it in one query? Personally I find that you can often gain both performance and code-readability by splitting overly complicated queries into more parts.
But hard to get an overview so can't really give a good answer to your question:)
Just add num_posts column to your table. Don't count posts with COUNT().
Can we get some...
Show Tables;
Desc Table lfx_forum_posts;
Desc Table lfx_forum_threads;
Desc Table lfx_forum_users;
Desc Table lfx_forum;
Here's some pseudo code
select f.*, (select count(*) from forum_posts fp where fp.forum_id = f.id) as num_posts,
(select count(*) from forum_threads ft where ft.forum_id = f.id) as num_threads,
(select max(fp.id) from forum_posts fp
where fp.id = f.id
) as latest_post_id,
from forums f;
Then go on to use latest_post_id in a seperate query to get it's information.
if it doesn't like using f before it's declared then make a temporary table for this then you update every time the query is ran.