I have a query over two tables -- matchoverview
id, home_id, away_id, date, season, result
matchattributes
id, game_id, attribute_id, attribute_value
My query
select m.id from matchOverview m
join matchAttributes ma on ma.match_id=m.id and ma.attribute_id in (3,4,5,6)
group by m.id
having sum(case when ma.attribute_id in (3,4)
then ma.attribute_value end) > 3
or sum(case when ma.attribute_id in (5,6)
then ma.attribute_value end) > 3;
Which returns all match ids where the sum of attributes 3 and 4 or 5 and 6 is greater than 3.
This particular query returns 900k rows, unsurprisingly in phpmyadmin this query takes a deal of time, as I imagine it needs to format the results into a table, but it clocks the query at .0113 seconds.
Yet when I make this query over PHP it takes 15 seconds, if I alter the query to LIMIT to only 100 results, it runs almost instantly, leaving me with the belief the only possibility being the amount of data being transferred is what is slowing it.
But would it really take 15 seconds to transfer 1M 4 byte ints over the network?
Is the only solution to further limit the query so that it returns less results?
EDIT
Results of an EXPLAIN on my query
id select_type table type key key key_len ref rows Extra
1 SIMPLE m index PRIMARY PRIMARY 4 NULL 2790717 Using index
1 SIMPLE ma ref match,attribute match 4 opta_matches2.m.id 2 Using where
How I am timing my SQL query
$time_pre = microtime(true);
$quer = $db->query($sql);
$time_post = microtime(true);
$exec_time = $time_post - $time_pre;
Data from slow query log
# Thread_id: 15 Schema: opta_matches2 QC_hit: No
# Query_time: 15.594386 Lock_time: 0.000089 Rows_sent: 923962 Rows_examined: 15688514
# Rows_affected: 0 Bytes_sent: 10726615
I am ok with dealing with a 15 second query if it is because that is how long it takes the data to move over the network, but if the query or my table can be optimized that is the best solution
The row count is not the issue, the following query
select m.id from matchOverview m
join matchAttributes ma on ma.match_id=m.id and ma.attribute_id in (1,2,3,4)
group by m.id
having sum(case when ma.attribute_id in (3,4)
then ma.attribute_value end) > 8
and sum(case when ma.attribute_id in (1,2)
then ma.attribute_value end) = 0;
returns only 24 rows but also takes ~15 seconds
phpMyAdmin doesn't give you all results,
it also using limit to default 25 results.
If you change this limit by changing "Number of rows" select box or type the limit in query, It will take more time to run the query.
I think if you rewrote the conditions, at a minimum you might find something out. For instance, I think this does the same as the second example (the 24 results one);
SELECT
m.id
, at.total_12
, at.total_34
FROM matchOverview AS m
JOIN (
SELECT
m.id
, SUM(IF (ma.attribute_id IN(1,2), ma.attribute_value, 0)) AS total_12
, SUM(IF (ma.attribute_id IN(3,4), ma.attribute_value, 0)) AS total_34
FROM matchAttributes AS ma
WHERE m.id = ma.match_id
AND ma.attribute_id IN(1,2,3,4)
GROUP BY m.id
) AS at
WHERE at.total_12 > 0
AND at.total_34 > 8
It's more verbose, but it could help triangulate where the bottleneck(s) come from more readily.
For instance, if (a working) version of the above is still slow, then run the inner query with the GROUP BY intact. Still slow? Remove the GROUP BY. Move the GROUP BY/SUM into the outer query, what happens?
That kinda thing. I can't run it so I can't work out a more precise answer, which I would like to know.
There are probably two significant parts to the timing: Locate the rows and decide which ids to send; then send them. I will address both.
Here's a way to better separate the elapsed time for just the query (and not the network): SELECT COUNT(*) FROM (...) AS x; Where '...' is the 1M-row query.
Speeding up the query
Since you aren't really using matchoverview, let's get rid of it:
select ma.match_id
from matchAttributes ma
WHERE ma.attribute_id in (3,4,5,6)
group by ma.match_id
having sum(case when ma.attribute_id in (3,4) then ma.attribute_value end) > 3
or sum(case when ma.attribute_id in (5,6) then ma.attribute_value end) > 3;
And have a composite index with the columns in this order:
INDEX(attribute_id, attribute_value, match_id)
As for the speedy LIMIT, that is because it can stop short. But a LIMIT without an ORDER BY is rather meaningless. If you add an ORDER BY, it will have to gather all the results, sort them, and finally perform the LIMIT.
Network transfer time
Transferring millions of rows (I see 10.7MB in the slowlog) over the network is time-consuming, but takes virtually no CPU time.
One EXPLAIN implies that there might be 2.8M rows; is that about correct? The slowlog says that about 16M rows are touched -- this may be because of the two tables, join, group by, etc. My reformulation and index should decrease the 16M significantly, hence decrease the elapsed time (before the network transfer time).
923K rows "sent" -- What will the client do with that many rows. In general, I find that more than a few thousand rows "sent" indicates poor design.
"take 15 seconds to transfer 1M 4 byte ints over the network" -- That is elapsed time, and cannot be sped up except by sending fewer rows. (BTW, it is probably sent as strings of several digits, plus overhead for each row; I don't whether the 10726615 is actual network bytes or counts only the ints.)
"the ids are used in an internal calculation" -- How do you calculate with ids? If you are looking up the ids in some other place, perhaps you can add complexity to the query, thereby doing more work before hitting the network; then shipping less data?
If you want to discuss further, please provide SHOW CREATE TABLE. (It may have some details that don't show up in your simplified table definition.)
Related
Context:
We have a website where users(merchants) can add their apps/websites into the system and pay their users via API. Now, the problem comes when we have to show the list of those transactions to the merchant on their dashboard. Each merchant generates hundreds of transactions per second and on average merchants have like 2 million transactions per day and on the dashboard, we have to show today's stats to the merchant.
Main Problem:
We have to show today's transactions to the merchant which is around 2 million records for a single merchant.
So a query like this,
SELECT * FROM transactions WHERE user_id = 123 LIMIT 0,15
Rows examined are 2 million in our example and that cannot be reduced in any way. The limit doesn't help here I think, because MySQL will still examine all rows an then pick first 15 from the result set.
How can we optimize queries like this where we have to show millions of records(with pagination of course) to the user?
Edit:
Explain output:
Query:
explain select a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id, b.name, b.url from transactions as a left join user_apps as b on a.user_app_id = b.id where a.sender_user_id = ? and a.created_at BETWEEN '2020-03-20' AND '2020-03-21' order by a.created_at desc limit 15 offset 0
Details:
Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column.
This query is taking 5 to 15 seconds to return 15 rows.
If I run the same query for the sender_user_id which has only 24 transactions in the table, then the response in instant.
First, let's fix what might be a bug: You are including two midnights in that "day". BETWEEN is "inclusive".
AND a.created_at BETWEEN '2020-03-20' AND '2020-03-21'
-->
AND a.created_at >= '2020-03-20'
AND a.created_at < '2020-03-20' + INTERVAL 1 DAY
(There is no performance change, just the elimination of tomorrow's midnight.)
In your simple query, only 15 rows will be touched due to the LIMIT. However, for more complex queries it may need to gather all rows, sort them, and only then peel off 15 rows. The technique for preventing that inefficiency goes something like this: Devise, if possible, an INDEX that handles all of the WHERE and the ORDER BY.
where a.sender_user_id = ?
AND a.created_at >= '2020-03-20'
AND a.created_at < '2020-03-20' + INTERVAL 1 DAY
order by a.created_at desc
needs INDEX(sender_user_id, created_at) -- in that order. (And, in your query, nothing else encroaches on that.)
Pagination via OFFSET introduces another performance problem -- it must step over all OFFSET rows before getting the ones you want. This is solvable by remembering where you left off.
So, why does EXPLAIN think it will hit a million rows? Because Explain is dumb when it comes to handling LIMIT. There is a better way to estimate the effort. That will show 15, not a million, if all is working well. For LIMIT 150, 15, it will show 165.
You said "Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column." Can you provide SHOW CREATE TABLE so we can check for something else subtle going on?
Hmmm... I wonder if
order by a.created_at desc
should be changed to match the index:
order by a.sender_user_id DESC, a.created_at desc
(What version of MySQL are you using? I did some experimenting and found no difference because of having (or not) sender_user_id in the `ORDER BY.)
(Trouble -- It seems that the JOIN prevents the effective use of LIMIT. Still digging...)
New suggestion:
select a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id,
b.name, b.url
from
(
SELECT a1.id
FROM transactions as a1
where a1.sender_user_id = ?
AND a.created_at >= '2020-03-20'
AND a.created_at < '2020-03-20' + INTERVAL 1 DAY
order by a1.created_at desc
limit 15 offset 0
) AS x
JOIN transactions AS a USING(id)
left join user_apps as b ON x.user_app_id = b.id
This uses a generic 'trick' to move the LIMIT into a derived table, with minimal other stuff. Then, with only 15 ids, the JOINs to other tables goes 'fast'.
In my experiment (with a different pair of tables), it touched only 5*15 rows. I checked multiple versions; all seem to need this technique. I used to he Handler_reads to verify the results.
When I tried with a JOIN but not a derived table, it was touching 2*N rows, where N was the number of rows without the LIMIT.
I'm working on a project which have a large set of data. The data are stored in a MySql Db.
I want to fetch records with pagination from few tables. One of the table is having over 2 millions of records which is causing the page to freeze for a very long time. Also sometimes the page is not able to load at all.
The query I'm using to fetch the records :
SELECT
EN.`MRN`,
P.`FNAME`,
P.`LNAME`,
P.`MI`,
P.`SSC`,
sum(EN.`AMOUNT`) AS `TOTAL_AMOUNT`
FROM `table_1` AS EN
INNER JOIN `table_2` AS P ON EN.`MRN` = P.`MRN`
GROUP BY EN.`MRN`,P.`FNAME`, P.`LNAME`,P.`MI`,P.`SSC`
HAVING sum(EN.`AMOUNT`) > 0
ORDER BY P.`LNAME`
By this query I'm getting the total number of records for the pagination to work. Then I again run this query to get the actual records :
SELECT
EN.`MRN`,
P.`FNAME`,
P.`LNAME`,
P.`MI`,
P.`SSC`,
sum(EN.`AMOUNT`) AS `TOTAL_AMOUNT`
FROM `table_1` AS EN
INNER JOIN `table_2` AS P ON EN.`MRN` = P.`MRN`
GROUP BY EN.`MRN`,P.`FNAME`, P.`LNAME`,P.`MI`,P.`SSC`
HAVING sum(EN.`AMOUNT`) > 0
ORDER BY P.`LNAME`
LIMIT 0, 100
How can I make this query to work faster. Because it takes a very long time execute the query for the first time to get total number of records.
It is better to separate total_amount from the query because that is the only value involving all the records. You can call this when you load the page. I assume that all your records in table_1 is valid.
SELECT sum(EN.`AMOUNT`) AS `TOTAL_AMOUNT` FROM `table_1` AS EN
HAVING sum(EN.`AMOUNT`) > 0
Then get the query result every time when you flip the page. This should only return 10 records starting from record 0.
SELECT
EN.`MRN`,
P.`FNAME`,
P.`LNAME`,
P.`MI`,
P.`SSC`
FROM `table_1` AS EN
INNER JOIN `table_2` AS P ON EN.`MRN` = P.`MRN`
GROUP BY EN.`MRN`,P.`FNAME`, P.`LNAME`,P.`MI`,P.`SSC`
HAVING sum(EN.`AMOUNT`) > 0
ORDER BY P.`LNAME`
LIMIT 10 OFFSET 0
Hope this helps.
There are a few things you can do to make this faster:
Use EXPLAIN to make sure your indices are set correctly / set the correct indices.
Only execute the complicated query once using SQL_CALC_FOUND_ROWS.
Use the TOTAL_AMOUNT alias in your HAVING statement to avoid doing the calculation twice.
MySQL will do some optimizations and caching itself so they might not all have the same impact.
I'm building a small messaging system for my app, the primary idea is to have 2 tables.
Table1 messages
Id,sender_id,title,body,file,parent_id
Here is where messages are stored, decoupled from whom will receive it to allow for multiple recipients.
Parent I'd link to parent message if its a reply, and file is a blob to store single file attached to message
Table 2 message_users
Id,thread_id,user_id,is_read,stared,deleted
Link parent thread to target users,
Now for a single user to get count of unread messages I can do
Select count(*) from message_users where user_id = 1 and is_read is null
To get count of all messages in his inbox I can do
Select count(*) from message_users where user_id = 1;
Question is how to combine both in single optimized query ?
So you're trying to achieve something that will total rows that meet one condition and the total rows that meet an extra condition:
|---------|---------|
| total | unread |
|---------|---------|
| 20 | 12 |
|---------|---------|
As such will need something with a form along the lines of:
SELECT A total, B unread FROM message_users WHERE user_id=1
A is fairly straightforward, you already more-or-less have it: COUNT(Id).
B is marginally more complicated and might take the form SUM( IF(is_read IS NULL,1,0) ) -- add 1 each time is_read is not null; the condition will depend on your database specifics.
Or B might look like: COUNT(CASE WHEN is_read IS NULL THEN 1 END) unread -- this is saying 'when is_read is null, count another 1'; the condition will depend on your database specifics.
In total:
SELECT COUNT(Id) total, COUNT(CASE WHEN is_read IS NULL THEN 1 END) unread FROM message_users WHERE user_id=1
Or:
SELECT COUNT(Id) total, SUM( IF(is_read IS NULL,1,0) ) unread FROM message_users WHERE user_id=1
In terms of optimised, I'm not aware of a query that can necessarily go quicker than this. (Would love to know of it if it does exist!) There may be ways to speed things up if you have a problem with performance:
Examine your indexes: use the built in tools EXPLAIN and some reading around etc.
Use caches and/or pre-compute the value and store it elsewhere -- e.g. have a field unread_messages against user and grab this value directly. Obviously there will need to be some on-write invalidation, or indeed some service running to keep these values up to date. There are many ways of achieving this, tools in MySQL, hand roll your own etc etc.
In short, start optimising from a requirement and some real data. (My query takes 0.8s, I need the results in 0.1s and they need to be consistent 100% of the time -- how can I achieve this?) Then you can tweak and experiment with the SQL, hardware that the server runs on (maybe?), caching/pre-calculate at different points etc.
In MySQL, when you count a field, it only counts non null occurrences of that field, so you should be able to do something like this:
SELECT COUNT(user_id), COUNT(user_id) - COUNT(is_read) AS unread
FROM message_users
WHERE user_id = 1;
Untested, but it should point you in the right direction.
You can use sum with CASE WHEN clause. If is_read is null then +1 is added to the sum, else +0.
SELECT count(*),
SUM(CASE WHEN is_read IS NULL THEN 1 ELSE 0 END) AS count_unread
FROM message_users WHERE user_id = 1;
I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?
Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.
Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).
It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.
Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html
Info: I have this table (PERSONS):
PERSON_ID int(10)
POINTS int(6)
4 OTHER COLUMNS which are of type int(5 or 6)
The table consist of 25M rows and is growing 0.25M a day. The distribution of points is around 0 to 300 points and 85% of the table has 0 points.
Question: I would like to return to the user which rank he/she has if they got at least 1 point. How and where would be the fastest way to do it, in SQL or PHP or combination?
Extra Info: Those lookups can happen every second 100 times. The solutions I have seen so far are not fast enough, if more info needed please ask.
Any advice is welcome, as you understand I am new to PHP and MySQL :)
Create an index on t(points) and on t(person_id, points). Then run the following query:
select count(*)
from persons p
where p.points >= (select points from persons p where p.person_id = <particular person>)
The subquery should use the second index as a lookup. The first should be an index scan on the first index.
Sometimes MySQL can be a little strange about optimization. So, this might actually be better:
select count(*)
from persons p cross join
(select points from persons p where p.person_id = <particular person>) const
where p.points > const.points;
This just ensures that the lookup for the points for the given person happens once, rather than for each row.
Partition your table into two partitions - one for people with 0 points and one for people with one or more points.
Add one index on points to your table and another on person_id (if these indexes don't already exist).
To find the dense rank of a specific person, run the query:
select count(distinct p2.points)+1
from person p1
join person p2 on p2.points > p1.points
where p1.person_id = ?
To find the non-dense rank of a specific person, run the query:
select count(*)
from person p1
join person p2 on p2.points >= p1.points
where p1.person_id = ?
(I would expect the dense rank query to run significantly faster.)