How to improve the performance of MYSQL query with large data?

How to improve the performance of MYSQL query with large data? - php

I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?

Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.

Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).

It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.

Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html

Related

Optimal joins in MySQL or offloading to application layer

I have 3 tables in a MySQL database: courses, users and participants, which contains about 30mil, 30k and 3k entries respectively.
My goal is to (efficiently) figure out the number of users that have been assigned to courses that matches our criteria. The criteria is a little more complex, but for this example we only care about users where deleted_at is null and courses where deleted_at is null and active is 1.
Simplified these are the columns:
users:
id
deleted_at
1
null
2
2022-01-01
courses:
id
active 
deleted_at
1
1
null
1
1
2020-01-01
2
0
2020-01-01
participants:
id
participant_id 
course_id
1
1
1
2
1
2
3
2
2
Based on the data above, the number we would get would be 1 as only user 1 is not deleted and that user assigned to some course (id 1) that is active and not deleted.
Here is a list of what I've tried.
Joining all the tables and do simple where's.
Joining using subqueries.
Pulling the correct courses and users out to the application layer (PHP), and querying participants using WHERE IN.
Pulling everything out and doing the filtering in the application layer.
Calling using EXPLAIN to add better indexes - I, admittedly, do not do this often and may not have done this well enough.
A combination of all the above.
An example of a query would be:
SELECT COUNT(DISTINCT participant_id)
FROM `participants`
INNER JOIN
(SELECT `courses`.`id`
FROM `courses`
WHERE (`active` = '1')
AND `deleted_at` IS NULL) AS `tempCourses` ON `tempCourses`.`id` = `participants`.`course_id`
WHERE `participant_type` = 'Eloomi\\Models\\User'
AND `participant_id` in
(SELECT `users`.`id`
FROM `users`
WHERE `users`.`deleted_at` IS NULL)
From what I can gather doing this will create a massive table, which only then will start applying where's. In my mind it should be possible to short circuit a lot of that because once we get a match for a user, we can disregard that going forward. That would be how to handle it, in my mind, in the application layer.
We could do this on a per-user basis in the application layer, but the number of requests to the database would make this a bad solution.
I have tagged it as PHP as well as MySQL, not because it has to be PHP but because I do not mind offloading some parts to the application layer if that is required. It's my experience that joins do not always use indexes optimally
Edit:
To specify my question: Can someone help me provide a efficient way to pull out the number of non-deleted users that have been assigned to to active non-deleted courses?

I would write it this way:
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL may join the tables in another order, not the order you list the tables in the query.
I hope that courses is the first table MySQL accesses, because it's probably the smallest table. Especially after filtering by active and deleted_at. The following index will help to narrow down that filtering, so only matching rows are examined:
ALTER TABLE courses ADD KEY (active, deleted_at);
Every index implicitly has the table's primary key (e.g. id) appended as the last column. That column being part of the index, it is used in the join to participants. So you need an index in participants that the join uses to find the corresponding rows in that table. The order of columns in the index is important.
ALTER TABLE participants ADD KEY (course_id, participant_type, participant_id);
The participant_id is used to join to the users table. MySQL's optimizer will probably prefer to join to users by its primary key, but you also want to restrict that by deleted_at, so you might need this index:
ALTER TABLE users ADD KEY (id, deleted_at);
And you might need to use an index hint to coax the optimizer to prefer this secondary index over the primary key index.
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u USE INDEX(deleted_at)
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL knows how to use compound indexes even if some conditions are in join clauses and other conditions are in the WHERE clause.
Caveat: I have not tested this. Choosing indexes may take several tries, and testing the EXPLAIN after each try.

optimise a mysql query to reduce rows searched

I'm currently grabbing the last 5 comments posted on my website, which I am seemingly doing quite badly I think.
Here is the SQL query:
SELECT c.comment_id
, c.article_id
, c.time_posted
, a.title
, a.slug
, u.username
FROM articles_comments c
JOIN articles a
ON c.article_id = a.article_id
JOIN users u
ON u.user_id = c.author_id
WHERE a.active = 1
AND c.approved = 1
ORDER
BY c.comment_id DESC
LIMIT 5
My problem, is that has to search through a lot of rows, it seems quite wasteful. I'm curious if there's a better way to do it.
Here's the output of explain on it:
As you can see, the rows it's giving is 81,486 which seems kind of hilarious. What am I missing here?

Turns out, simply forcing articles_comments to use the PRIMARY key (comment_id) as the INDEX fixed it.
The issue is that my sorting is picking ALL approved comments, so it was using the approved column to sort resulting in it picking data from tens of thousands.

c: INDEX(approved, comment_id) -- in this order
a: I assume you have PRIMARY KEY(article_id)
u: I assume you have PRIMARY KEY(user_id)
The hope is that the c index will handle some of the WHERE, plus the ORDER BY and LIMIT. The worst case is that it must scan the entire table without finding 5 rows.
The 81,486 is bogus. Here's a precise way to get good info:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The 'reads' will indicate how many rows of data and index were touched; the writes indicate temp table(s) being used.

yii2 data provider query takes very long time

I am using yii2 data Provider to extract data from database. Raw query looks like this
SELECT `client_money_operation`.* FROM `client_money_operation`
LEFT JOIN `user` ON `client_money_operation`.`user_id` = `user`.`id`
LEFT JOIN `client` ON `client_money_operation`.`client_id` = `client`.`id`
LEFT JOIN `client_bonus_operation` ON `client_money_operation`.`id` = `client_bonus_operation`.`money_operation_id`
WHERE (`client_money_operation`.`status`=0) AND (`client_money_operation`.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code` ORDER BY `created_at` DESC LIMIT 10
this query takes 107 seconds to execute.
Table client_money operations contains 132000 rows. What do I need to do to optimise this query, or set up my database properly?

Try pagination. But if you must have to show large set of records in one go remove as many left joins as you can. You can duplicate some data in the client_money_operation table if it is certainly required to show in the one-go result set.

SELECT mo.*
FROM `client_money_operation` AS mo
LEFT JOIN `user` AS u ON mo.`user_id` = u.`id`
LEFT JOIN `client` AS c ON mo.`client_id` = c.`id`
LEFT JOIN `client_bonus_operation` AS bo ON mo.`id` = bo.`money_operation_id`
WHERE (mo.`status`=0)
AND (mo.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code`
ORDER BY `created_at` DESC
LIMIT 10
is a rather confusing use of GROUP BY. First, it is improper to group by one column while having lots of non-aggregated columns in the SELECT list. And the use of created_at in the ORDER BY does not make sense since it is unclear which date will be associated with each operation_code. Perhaps you want MIN(created_at)?
Optimization...
There will be a full scan of mo and (hopefully) PRIMARY KEY lookups into the other tables. Please provide EXPLAIN SELECT ... so we can check this.
The only useful index on mo is INDEX(status, created_at), and it may or may not be useful, depending on how big that date range is.
bo needs some index starting with money_operation_id.
What table(s) are operation_code and created_at in? It makes a big difference to the Optimizer.
But there is a pattern that can probably be used to greatly speed up the query. (I can't give you details without knowing what table those columns are in, nor whether it can be made to work.)
SELECT mo.*
FROM ( SELECT mo.id FROM .. WHERE .. GROUP BY .. ORDER BY .. LIMIT .. ) AS x
JOIN mo ON x.id = mo.id
ORDER BY .. -- yes, repeated
That is, first do (in a derived table) the minimal work to find ids for the 10 rows desired, then use JOIN(s) to fetch there other columns needed.
(If yii2 cannot be made to generate such, then it is in the way.)

Displaying a large amount of data in paging table without heavily impacting DB

The current implementation is a single complex query with multiple joins and temporary tables, but is putting too much stress on my MySQL and is taking upwards of 30+ seconds to load the table. The data is retrieved by PHP via a JavaScript Ajax call and displayed on a webpage. Here is the tables involved:
Table: table_companies
Columns: company_id, ...
Table: table_manufacture_line
Columns: line_id, line_name, ...
Table: table_product_stereo
Columns: product_id, line_id, company_id, assembly_datetime, serial_number, ...
Table: table_product_television
Columns: product_id, line_id, company_id, assembly_datetime, serial_number, warranty_expiry, ...
A single company can have 100k+ items split between the two product tables. The product tables are unioned and filtered by the line_name, then ordered by assembly_datetime and limited depending on the paging. The datetime value is also reliant on timezone and this is applied as part of the query (another JOIN + temp table). line_name is also one of the returned columns.
I was thinking of splitting the line_name filter out from the product union query. Essentially I'd determine the ids of the lines that correspond to the filter, then do a UNION query with a WHERE condition WHERE line_id IN (<results from previous query>). This would cut out the need for joins and temp tables, and I can apply the line_name to line_id and timezone modification in PHP, but I'm not sure this is the best way to go about things.
I have also looked at potentially using Redis, but the large number of individual products is leading to a similarly long wait time when pushing all of the data to Redis via PHP (20-30 seconds), even if it is just pulled in directly from the product tables.
Is it possible to tweak the existing queries to increase the efficiency?
Can I push some of the handling to PHP to decrease the load on the SQL server? What about Redis?
Is there a way to architect the tables better?
What other solution(s) would you suggest?
I appreciate any input you can provide.
Edit:
Existing query:
SELECT line_name,CONVERT_TZ(datetime,'UTC',timezone) datetime,... FROM (SELECT line_name,datetime,... FROM ((SELECT line_id,assembly_datetime datetime,... FROM table_product_stereos WHERE company_id=# ) UNION (SELECT line_id,assembly_datetime datetime,... FROM table_product_televisions WHERE company_id=# )) AS union_products INNER JOIN table_manufacture_line USING (line_id)) AS products INNER JOIN (SELECT timezone FROM table_companies WHERE company_id=# ) AS tz ORDER BY datetime DESC LIMIT 0,100
Here it is formatted for some readability.
SELECT line_name,CONVERT_TZ(datetime,'UTC',tz.timezone) datetime,...
FROM (SELECT line_name,datetime,...
FROM (SELECT line_id,assembly_datetime datetime,...
FROM table_product_stereos WHERE company_id=#
UNION
SELECT line_id,assembly_datetime datetime,...
FROM table_product_televisions
WHERE company_id=#
) AS union_products
INNER JOIN table_manufacture_line USING (line_id)
) AS products
INNER JOIN (SELECT timezone
FROM table_companies
WHERE company_id=#
) AS tz
ORDER BY datetime DESC LIMIT 0,100
IDs are indexed; Primary keys are the first key for each column.

Let's build this query up from its component parts to see what we can optimize.
Observation: you're fetching the 100 most recent rows from the union of two large product tables.
So, let's start by trying to optimize the subqueries fetching stuff from the product tables. Here is one of them.
SELECT line_id,assembly_datetime datetime,...
FROM table_product_stereos
WHERE company_id=#
But look, you only need the 100 newest entries here. So, let's add
ORDER BY assembly_datetime DESC
LIMIT 100
to this query. Also, you should put a compound index on this table as follows. This will allow both the WHERE and ORDER BY lookups to be satisfied by the index.
CREATE INDEX id_date ON table_product_stereos (company_id, assembly_datetime)
All the same considerations apply to the query from table_product_televisions. Order it by the time, limit it to 100, and index it.
If you need to apply other selection criteria, you can put them in these inner queries. For example, in a comment you mentioned a selection based on a substring search. You could do this as follows
SELECT t.line_id,t.assembly_datetime datetime,...
FROM table_product_stereos AS t
JOIN table_manufacture_line AS m ON m.line_id = t.line_id
AND m.line_name LIKE '%test'
WHERE company_id=#
ORDER BY assembly_datetime DESC
LIMIT 100
Next, you are using UNION to combine those two query result sets into one. UNION has the function of eliminating duplicates, which is time-consuming. (You know you don't have duplicates, but MySQL doesn't.) Use UNION ALL instead.
Putting this all together, the innermost sub query becomes this. We have to wrap up the subqueries because SQL is confused by UNION and ORDER BY clauses at the same query level.
SELECT * FROM (
SELECT line_id,assembly_datetime datetime,...
FROM table_product_stereos
WHERE company_id=#
ORDER BY assembly_datetime DESC
LIMIT 100
) AS st
UNION ALL
SELECT * FROM (
SELECT line_id,assembly_datetime datetime,...
FROM table_product_televisions
WHERE company_id=#
ORDER BY assembly_datetime DESC
LIMIT 100
) AS tv
That gets you 200 rows. It should get those rows fairly quickly.
200 rows are guaranteed to be enough to give you the 100 most recent items later on after you do your outer ORDER BY ... LIMIT operation. But that operation only has to crunch 200 rows, not 100K+, so it will be far faster.
Finally wrap up this query in your outer query material. Join the table_manufacture_line information, and fix up the timezone.
If you do the indexing and the ORDER BY ... LIMIT operation earlier, this query should become very fast.
The comment dialog in your question indicates to me that you may have multiple product types, not just two, and that you have complex selection criteria for your paged display. Using UNION ALL on large numbers of rows slams performance: it converts multiple indexed tables into an internal list of rows that simply can't be searched efficiently.
You really should consider putting your two kinds of product data in a single table instead of having to UNION ALL multiple product tables. The setup you have now is inflexible and won't scale up easily. If you structure your schema with a master product table and perhaps some attribute tables for product-specific information, you will find yourself much happier two years from now. Seriously. Please consider making the change.

Remember: Index fast, data slow. Use joins over nested queries. Nested queries return all of the data fields whereas joins just consider the filters (which should all be indexed - make sure there's a unique index on table_product_*.line_id). It's been a while but I'm pretty sure you can join "ON company_id=#" which should cut down the results early on.
In this case, all of the results refer to the same company (or a much smaller subset) so it makes sense to run that query separately (and it makes the query more maintainable).
So your data source would be:
(table_product_stereos as prod
INNER JOIN table_manufacture_line AS ml ON prod.line_id = ml.line_id and prod.company_id=#
UNION
table_product_televisions as prod
INNER JOIN table_manufacture_line as ml on prod.line_id = ml.line_id and prod.company_id=#)
From which you can select prod. or ml. fields as required.

PHP is not a solution at all...
Redis can be a solution.
But the main thing I would change is the index creation for the tables (add missing indexe)...If you're running into temp tables you didn't create indexes well for the tables. And 100k rows in not much at all.
But I cant help you without any table creation statements as well as queries you run.
Make sure your "where part" is part of youf btree index from left to right.

Fastest MySQL row rank of big table

Info: I have this table (PERSONS):
PERSON_ID int(10)
POINTS int(6)
4 OTHER COLUMNS which are of type int(5 or 6)
The table consist of 25M rows and is growing 0.25M a day. The distribution of points is around 0 to 300 points and 85% of the table has 0 points.
Question: I would like to return to the user which rank he/she has if they got at least 1 point. How and where would be the fastest way to do it, in SQL or PHP or combination?
Extra Info: Those lookups can happen every second 100 times. The solutions I have seen so far are not fast enough, if more info needed please ask.
Any advice is welcome, as you understand I am new to PHP and MySQL :)

Create an index on t(points) and on t(person_id, points). Then run the following query:
select count(*)
from persons p
where p.points >= (select points from persons p where p.person_id = <particular person>)
The subquery should use the second index as a lookup. The first should be an index scan on the first index.
Sometimes MySQL can be a little strange about optimization. So, this might actually be better:
select count(*)
from persons p cross join
(select points from persons p where p.person_id = <particular person>) const
where p.points > const.points;
This just ensures that the lookup for the points for the given person happens once, rather than for each row.

Partition your table into two partitions - one for people with 0 points and one for people with one or more points.
Add one index on points to your table and another on person_id (if these indexes don't already exist).
To find the dense rank of a specific person, run the query:
select count(distinct p2.points)+1
from person p1
join person p2 on p2.points > p1.points
where p1.person_id = ?
To find the non-dense rank of a specific person, run the query:
select count(*)
from person p1
join person p2 on p2.points >= p1.points
where p1.person_id = ?
(I would expect the dense rank query to run significantly faster.)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.