Optimal joins in MySQL or offloading to application layer

Optimal joins in MySQL or offloading to application layer - php

I have 3 tables in a MySQL database: courses, users and participants, which contains about 30mil, 30k and 3k entries respectively.
My goal is to (efficiently) figure out the number of users that have been assigned to courses that matches our criteria. The criteria is a little more complex, but for this example we only care about users where deleted_at is null and courses where deleted_at is null and active is 1.
Simplified these are the columns:
users:
id
deleted_at
1
null
2
2022-01-01
courses:
id
active 
deleted_at
1
1
null
1
1
2020-01-01
2
0
2020-01-01
participants:
id
participant_id 
course_id
1
1
1
2
1
2
3
2
2
Based on the data above, the number we would get would be 1 as only user 1 is not deleted and that user assigned to some course (id 1) that is active and not deleted.
Here is a list of what I've tried.
Joining all the tables and do simple where's.
Joining using subqueries.
Pulling the correct courses and users out to the application layer (PHP), and querying participants using WHERE IN.
Pulling everything out and doing the filtering in the application layer.
Calling using EXPLAIN to add better indexes - I, admittedly, do not do this often and may not have done this well enough.
A combination of all the above.
An example of a query would be:
SELECT COUNT(DISTINCT participant_id)
FROM `participants`
INNER JOIN
(SELECT `courses`.`id`
FROM `courses`
WHERE (`active` = '1')
AND `deleted_at` IS NULL) AS `tempCourses` ON `tempCourses`.`id` = `participants`.`course_id`
WHERE `participant_type` = 'Eloomi\\Models\\User'
AND `participant_id` in
(SELECT `users`.`id`
FROM `users`
WHERE `users`.`deleted_at` IS NULL)
From what I can gather doing this will create a massive table, which only then will start applying where's. In my mind it should be possible to short circuit a lot of that because once we get a match for a user, we can disregard that going forward. That would be how to handle it, in my mind, in the application layer.
We could do this on a per-user basis in the application layer, but the number of requests to the database would make this a bad solution.
I have tagged it as PHP as well as MySQL, not because it has to be PHP but because I do not mind offloading some parts to the application layer if that is required. It's my experience that joins do not always use indexes optimally
Edit:
To specify my question: Can someone help me provide a efficient way to pull out the number of non-deleted users that have been assigned to to active non-deleted courses?

I would write it this way:
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL may join the tables in another order, not the order you list the tables in the query.
I hope that courses is the first table MySQL accesses, because it's probably the smallest table. Especially after filtering by active and deleted_at. The following index will help to narrow down that filtering, so only matching rows are examined:
ALTER TABLE courses ADD KEY (active, deleted_at);
Every index implicitly has the table's primary key (e.g. id) appended as the last column. That column being part of the index, it is used in the join to participants. So you need an index in participants that the join uses to find the corresponding rows in that table. The order of columns in the index is important.
ALTER TABLE participants ADD KEY (course_id, participant_type, participant_id);
The participant_id is used to join to the users table. MySQL's optimizer will probably prefer to join to users by its primary key, but you also want to restrict that by deleted_at, so you might need this index:
ALTER TABLE users ADD KEY (id, deleted_at);
And you might need to use an index hint to coax the optimizer to prefer this secondary index over the primary key index.
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u USE INDEX(deleted_at)
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL knows how to use compound indexes even if some conditions are in join clauses and other conditions are in the WHERE clause.
Caveat: I have not tested this. Choosing indexes may take several tries, and testing the EXPLAIN after each try.

Related

yii2 data provider query takes very long time

I am using yii2 data Provider to extract data from database. Raw query looks like this
SELECT `client_money_operation`.* FROM `client_money_operation`
LEFT JOIN `user` ON `client_money_operation`.`user_id` = `user`.`id`
LEFT JOIN `client` ON `client_money_operation`.`client_id` = `client`.`id`
LEFT JOIN `client_bonus_operation` ON `client_money_operation`.`id` = `client_bonus_operation`.`money_operation_id`
WHERE (`client_money_operation`.`status`=0) AND (`client_money_operation`.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code` ORDER BY `created_at` DESC LIMIT 10
this query takes 107 seconds to execute.
Table client_money operations contains 132000 rows. What do I need to do to optimise this query, or set up my database properly?

Try pagination. But if you must have to show large set of records in one go remove as many left joins as you can. You can duplicate some data in the client_money_operation table if it is certainly required to show in the one-go result set.

SELECT mo.*
FROM `client_money_operation` AS mo
LEFT JOIN `user` AS u ON mo.`user_id` = u.`id`
LEFT JOIN `client` AS c ON mo.`client_id` = c.`id`
LEFT JOIN `client_bonus_operation` AS bo ON mo.`id` = bo.`money_operation_id`
WHERE (mo.`status`=0)
AND (mo.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code`
ORDER BY `created_at` DESC
LIMIT 10
is a rather confusing use of GROUP BY. First, it is improper to group by one column while having lots of non-aggregated columns in the SELECT list. And the use of created_at in the ORDER BY does not make sense since it is unclear which date will be associated with each operation_code. Perhaps you want MIN(created_at)?
Optimization...
There will be a full scan of mo and (hopefully) PRIMARY KEY lookups into the other tables. Please provide EXPLAIN SELECT ... so we can check this.
The only useful index on mo is INDEX(status, created_at), and it may or may not be useful, depending on how big that date range is.
bo needs some index starting with money_operation_id.
What table(s) are operation_code and created_at in? It makes a big difference to the Optimizer.
But there is a pattern that can probably be used to greatly speed up the query. (I can't give you details without knowing what table those columns are in, nor whether it can be made to work.)
SELECT mo.*
FROM ( SELECT mo.id FROM .. WHERE .. GROUP BY .. ORDER BY .. LIMIT .. ) AS x
JOIN mo ON x.id = mo.id
ORDER BY .. -- yes, repeated
That is, first do (in a derived table) the minimal work to find ids for the 10 rows desired, then use JOIN(s) to fetch there other columns needed.
(If yii2 cannot be made to generate such, then it is in the way.)

Create table report from joining table in MySQL

I have a proble with my query, i have to create report but there is no same data.
here is my database https://www.db-fiddle.com/f/2bA7StrBpz18tLFgAQh2QV/1
and this is my query example but the result is wrong :
SELECT
a.IdBukti,c.LineName,a.LineID,a.Tanggal,b.TypeProduksi AS partnamemonthly,a.PartID AS partmonthly,a.QtyPlanning AS qtymonthly,
d.partnamedaily,d.partiddaily,d.qtydaily
FROM
trans_ppicbdt_dt a
INNER JOIN ms_part b ON b.PartId = a.PartID
INNER JOIN ms_line c ON c.LineID = a.LineID
INNER JOIN(SELECT
c.LineName,
a.LineID,
a.Tanggal,
b.TypeProduksi AS partnamedaily,
a.PartID AS partiddaily,
a.QtyPlanning AS qtydaily
FROM
trans_ppich a
INNER JOIN ms_part b ON b.PartId = a.PartID
INNER JOIN ms_line c ON c.LineID = a.LineID
WHERE
a.Tanggal = '2018-04-11' AND a.DivisiId='DI070' AND a.IdLocation='1'
GROUP BY
a.LineID,
a.PartID) d on d.LineID=a.LineID AND d.Tanggal=a.Tanggal
WHERE
a.Tanggal = '2018-04-11' AND a.DivisiId='DI070' AND a.IdLocation='1'
GROUP BY
a.LineID,
a.PartID
So i have 2 data, first monthlyplan and second daily plan.
And i want the result like this
Can you help me to create the report in one single query

It's easier (for me) to figure out how to formulate a query if I can see the relations in a glance. Here's how Yours would look:
(monthly) (daily)
trans_ppicbdt_dt ms_line trans_ppich
---------------- ----------- --------------
IdUnik (primary) ----> LineID <---- IdBukti (FK)*
LineID ------------/ \--- LineID
There are some problems with your structure that need to be fixed at the earliest opportunity:
ms_line needs a primary key. lineID should work if it is unique. (no such luck, it is not unique)
trans_ppich needs a primary key. IdBukti is a foreign key to yet another table.
There needs to be an index on lineID in all tables.
Tables need to be normalized. You shouldn't have to repeat data over several lines.
Nevertheless, we can still get to where you want to go. Now that I can see the relation, the query is really rather simple in theory, but I don't have all the required tables and fields:
SELECT DISTINCT line.LineName,
monthly.{unknown field} as monthlyPartName,
monthly.PartID as monthlyPartID,
monthly.{unknown field} as monthlyProcess,
monthly.{unknown field} as monthlyQty,
daily.{unknown field} as dailyPartName,
daily.PartID as dailyPartID,
daily.{unknown field} as dailyQty,
{unknown table}.{unknown field} as remarks
FROM ms_line as line
LEFT JOIN trans_ppicbdt_dt AS monthly ON line.lineID=monthly.lineID
LEFT JOIN trans_ppich AS daily ON line.lineID=daily.lineID
WHERE
{your where clause to filter results}
The way this works is it iterates through the (distinct) line items, and looks for a match in the monthly table (left join monthly). If there is a match, it adds the values. If there isn't a match, it simply adds null for values. Then it does the same for the daily table: if there is a match, it adds the values, otherwise it simply gives null for all daily values.
This will fall apart, however, if lineID is duplicated in rows of the monthly and daily tables.
I don't know where the remarks come from, so that is left as an exercise for you. :)

Slow MySQL Query - Cache the data in a PHP array?

I need to select some data from MySQL DB using PHP. It can be done within one single MySQL query which takes 5 minutes to run on a good server (multiple JOINs on tables with more that 10 Mio rows).
I was wondering if it is a better practice to split the query in PHP and use some loops, rather than MySQL. Also, would it be better to query all the emails from one table with 150 000 rows in an array and then check the array instead of doing thousands of MySQL SELECTs.
Here is the Query:
SELECT count(contacted_emails.id), contacted_emails.email
FROM contacted_emails
LEFT OUTER JOIN blacklist ON contacted_emails.email = blacklist.email
LEFT OUTER JOIN submission_authors ON contacted_emails.email = submission_authors.email
LEFT OUTER JOIN users ON contacted_emails.email = users.email
GROUP BY contacted_emails.email
HAVING count(contacted_emails.id) > 3
The EXPLAIN returns:
The indexes in the 4 tables are:
contacted_emails: id, blacklist_section_id, journal_id and mail
blacklist: id, email and name
submission_authors: id, hash_key and email
users: id, email, firstname, lastname, editor_id, title_id, country_id, workplace_id
jobtype_id
The table contacted_emails is created like:
CREATE TABLE contacted_emails (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
email varchar(150) COLLATE utf8_unicode_ci NOT NULL,
contacted_at datetime NOT NULL,
created_at datetime NOT NULL,
blacklist_section_id int(11) unsigned NOT NULL,
journal_id int(10) DEFAULT NULL,
PRIMARY KEY (id),
KEY blacklist_section_id (blacklist_section_id),
KEY journal_id (journal_id),
KEY email (email) )
ENGINE=InnoDB AUTO_INCREMENT=4491706 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Your indexes look fine.
The performance problems seem to come from the fact that you're JOINing all rows, then filtering using HAVING.
This would probably work better instead:
SELECT *
FROM (
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING COUNT(id) > 3
) AS ce
LEFT OUTER JOIN blacklist AS bl ON ce.email = bl.email
LEFT OUTER JOIN submission_authors AS sa ON ce.email = sa.email
LEFT OUTER JOIN users AS u ON ce.email = u.email
/* EDIT: Exclude-join clause added based on comments below */
WHERE bl.email IS NULL
AND sa.email IS NULL
AND u.email IS NULL
Here you're limiting your initial GROUPed data set before the JOINs, which is significantly more optimal.
Although given the context of your original query, the LEFT OUTER JOIN tables dom't seem to be used at all, so the below would probably return the exact same results with even less overhead:
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING count(id) > 3
What exactly is the point of those JOINed tables? the LEFT JOIN prevents them from reducing the data any, and you're only looking at the aggregate data from contacted_emails. Did you mean to use INNER JOIN instead?
EDIT: You mentioned that the point of the joins is to exclude emails in your existing tables. I modified my first query to do a proper exclude join (this was a bug in your originally posted code).
Here's another possible option that may perform well for you:
SELECT
FROM contacted_emails
LEFT JOIN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
) AS existing ON contacted_emails.email = existing.email
WHERE existing.email IS NULL
GROUP BY contacted_emails.email
HAVING COUNT(id) > 3
What I'm doing here is gathering the existing emails in a subquery and doing a single exclude join on that derived table.
Another way you may try to express this is as a non-correlated subquery in the WHERE clause:
SELECT
FROM contacted_emails
WHERE email NOT IN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
)
GROUP BY email
HAVING COUNT(id) > 3
Try them all and see which gives the best execution plan in MySQL

A couple of thoughts, in terms of the query you may find it faster if you
count(*) row_count
and change the HAVING to
row_count > 3
as this can be satisfied from the contacted_emails.email index without having to access the row to get the contacted_emails.id. As both fields are NOT NULL and contacted_emails is the base table this should be the same logic.
As this query will only lengthen as you collect more data, I would suggest a summary table where you store the counts (possibly per some time unit). This can either be update periodically with a cronjob or on the fly with triggers and/or application logic.
If you use a per time unit option on created_at and/or store the last update to the cron, you should be able to get live results by pulling in and appending the latest data.
Any cache solution would have to be adjusted anyway to stay live and the full query run every time the data is cleared/updated.
As suggested in the comments, the database is built for aggregating large amounts of data.. PHP isn't.

You would probably be best with a Summary table that is updated via trigger on every insert into your contacted emails table. This Summary table should have the email address and a count column. Every insert into contacted table, update the count. Have an index on your count column in the summary table. Then you can query directly from THAT, have the email account in question, THEN join to get the rest of whatever details need to be pulled.

Following your recommandations, I was choosing this solution:
SELECT ce.email, ce.number_of_contacts
FROM (
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING number_of_contacts > 3
) AS ce
NATURAL LEFT JOIN blacklist AS bl
NATURAL LEFT JOIN submission_authors AS sa
NATURAL LEFT JOIN users AS u
WHERE bl.email IS NULL AND sa.email IS NULL AND u.email IS NULL
This is taking 10sec to run which is fine for the moment. Once I will have more data in the database, I will need to think about another solution where I will create a temporary table.
So, to conclude, loading an entire table as php array is not good for the performance as making mysql queries.

SQL Optimization WHERE vs JOIN

I am using mysql and this is a query that I quickly wrote, but I feel this can be optimized using JOINS. This is an example btw.
users table:
id user_name first_name last_name email password
1 bobyxl Bob Cox bob#gmail.com pass
player table
id role player_name user_id server_id racial_name
3 0 boby123 1 2 Klingon
1 1 example 2 2 Race
2 0 boby2 1 1 Klingon
SQL
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`,`users`
WHERE `users`.`id` = 1
and `users`.`id` = `player`.`user_id`
I know I can use a left join but what are the benefits
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
LEFT JOIN `users`
ON `users`.`id` = `player`.`user_id`
WHERE `users`.`id` = 1
What are the benefits, I get the same results ether way.

Your query has a JOIN in it. It is the same as writing:
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
INNER JOIN `users` ON `users`.`id` = `player`.`user_id`
WHERE `users`.`id` = 1
The only reason for you to use left join is if you want to get data from player table even when you don't have matches in users table.

LEFT JOIN will get data from the left table even if there's no equal data from the right side table.
I guess at one point, that player table's data will not be equivalent to users table specially if the data on users table has not been inserted into player table.
Your first query might return null on cases that the 2nd table (player) has no equivalent data corresponding to users table.
Also, IMHO, setting up another table for servers is a good idea in terms of complying to the normalization rules in database structure. After all, what details of the server_id is the column on player table pointing to.

The first solution makes a direct product (gets and connects everything with everything) then drops away the bad results. If you have a lot of rows this will be very slow!
The left join gets first the left table then put only the matching rows from the right (or null).
In your example you don't even need join. :)
This'll give you the same result and it'll be good until you just check for user id:
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
WHERE `player`.`user_id` = 1
Another solution if you want more conditions, without join could be something like this:
SELECT * FROM player WHERE player.user_id IN (SELECT id FROM user WHERE ...... )

How to improve the performance of MYSQL query with large data?

I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?

Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.

Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).

It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.

Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.