Slow MySQL Query - Cache the data in a PHP array? - php

I need to select some data from MySQL DB using PHP. It can be done within one single MySQL query which takes 5 minutes to run on a good server (multiple JOINs on tables with more that 10 Mio rows).
I was wondering if it is a better practice to split the query in PHP and use some loops, rather than MySQL. Also, would it be better to query all the emails from one table with 150 000 rows in an array and then check the array instead of doing thousands of MySQL SELECTs.
Here is the Query:
SELECT count(contacted_emails.id), contacted_emails.email
FROM contacted_emails
LEFT OUTER JOIN blacklist ON contacted_emails.email = blacklist.email
LEFT OUTER JOIN submission_authors ON contacted_emails.email = submission_authors.email
LEFT OUTER JOIN users ON contacted_emails.email = users.email
GROUP BY contacted_emails.email
HAVING count(contacted_emails.id) > 3
The EXPLAIN returns:
The indexes in the 4 tables are:
contacted_emails: id, blacklist_section_id, journal_id and mail
blacklist: id, email and name
submission_authors: id, hash_key and email
users: id, email, firstname, lastname, editor_id, title_id, country_id, workplace_id
jobtype_id
The table contacted_emails is created like:
CREATE TABLE contacted_emails (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
email varchar(150) COLLATE utf8_unicode_ci NOT NULL,
contacted_at datetime NOT NULL,
created_at datetime NOT NULL,
blacklist_section_id int(11) unsigned NOT NULL,
journal_id int(10) DEFAULT NULL,
PRIMARY KEY (id),
KEY blacklist_section_id (blacklist_section_id),
KEY journal_id (journal_id),
KEY email (email) )
ENGINE=InnoDB AUTO_INCREMENT=4491706 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Your indexes look fine.
The performance problems seem to come from the fact that you're JOINing all rows, then filtering using HAVING.
This would probably work better instead:
SELECT *
FROM (
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING COUNT(id) > 3
) AS ce
LEFT OUTER JOIN blacklist AS bl ON ce.email = bl.email
LEFT OUTER JOIN submission_authors AS sa ON ce.email = sa.email
LEFT OUTER JOIN users AS u ON ce.email = u.email
/* EDIT: Exclude-join clause added based on comments below */
WHERE bl.email IS NULL
AND sa.email IS NULL
AND u.email IS NULL
Here you're limiting your initial GROUPed data set before the JOINs, which is significantly more optimal.
Although given the context of your original query, the LEFT OUTER JOIN tables dom't seem to be used at all, so the below would probably return the exact same results with even less overhead:
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING count(id) > 3
What exactly is the point of those JOINed tables? the LEFT JOIN prevents them from reducing the data any, and you're only looking at the aggregate data from contacted_emails. Did you mean to use INNER JOIN instead?
EDIT: You mentioned that the point of the joins is to exclude emails in your existing tables. I modified my first query to do a proper exclude join (this was a bug in your originally posted code).
Here's another possible option that may perform well for you:
SELECT
FROM contacted_emails
LEFT JOIN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
) AS existing ON contacted_emails.email = existing.email
WHERE existing.email IS NULL
GROUP BY contacted_emails.email
HAVING COUNT(id) > 3
What I'm doing here is gathering the existing emails in a subquery and doing a single exclude join on that derived table.
Another way you may try to express this is as a non-correlated subquery in the WHERE clause:
SELECT
FROM contacted_emails
WHERE email NOT IN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
)
GROUP BY email
HAVING COUNT(id) > 3
Try them all and see which gives the best execution plan in MySQL

A couple of thoughts, in terms of the query you may find it faster if you
count(*) row_count
and change the HAVING to
row_count > 3
as this can be satisfied from the contacted_emails.email index without having to access the row to get the contacted_emails.id. As both fields are NOT NULL and contacted_emails is the base table this should be the same logic.
As this query will only lengthen as you collect more data, I would suggest a summary table where you store the counts (possibly per some time unit). This can either be update periodically with a cronjob or on the fly with triggers and/or application logic.
If you use a per time unit option on created_at and/or store the last update to the cron, you should be able to get live results by pulling in and appending the latest data.
Any cache solution would have to be adjusted anyway to stay live and the full query run every time the data is cleared/updated.
As suggested in the comments, the database is built for aggregating large amounts of data.. PHP isn't.

You would probably be best with a Summary table that is updated via trigger on every insert into your contacted emails table. This Summary table should have the email address and a count column. Every insert into contacted table, update the count. Have an index on your count column in the summary table. Then you can query directly from THAT, have the email account in question, THEN join to get the rest of whatever details need to be pulled.

Following your recommandations, I was choosing this solution:
SELECT ce.email, ce.number_of_contacts
FROM (
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING number_of_contacts > 3
) AS ce
NATURAL LEFT JOIN blacklist AS bl
NATURAL LEFT JOIN submission_authors AS sa
NATURAL LEFT JOIN users AS u
WHERE bl.email IS NULL AND sa.email IS NULL AND u.email IS NULL
This is taking 10sec to run which is fine for the moment. Once I will have more data in the database, I will need to think about another solution where I will create a temporary table.
So, to conclude, loading an entire table as php array is not good for the performance as making mysql queries.

Related

Optimal joins in MySQL or offloading to application layer

I have 3 tables in a MySQL database: courses, users and participants, which contains about 30mil, 30k and 3k entries respectively.
My goal is to (efficiently) figure out the number of users that have been assigned to courses that matches our criteria. The criteria is a little more complex, but for this example we only care about users where deleted_at is null and courses where deleted_at is null and active is 1.
Simplified these are the columns:
users:
id
deleted_at
1
null
2
2022-01-01
courses:
id
active 
deleted_at
1
1
null
1
1
2020-01-01
2
0
2020-01-01
participants:
id
participant_id 
course_id
1
1
1
2
1
2
3
2
2
Based on the data above, the number we would get would be 1 as only user 1 is not deleted and that user assigned to some course (id 1) that is active and not deleted.
Here is a list of what I've tried.
Joining all the tables and do simple where's.
Joining using subqueries.
Pulling the correct courses and users out to the application layer (PHP), and querying participants using WHERE IN.
Pulling everything out and doing the filtering in the application layer.
Calling using EXPLAIN to add better indexes - I, admittedly, do not do this often and may not have done this well enough.
A combination of all the above.
An example of a query would be:
SELECT COUNT(DISTINCT participant_id)
FROM `participants`
INNER JOIN
(SELECT `courses`.`id`
FROM `courses`
WHERE (`active` = '1')
AND `deleted_at` IS NULL) AS `tempCourses` ON `tempCourses`.`id` = `participants`.`course_id`
WHERE `participant_type` = 'Eloomi\\Models\\User'
AND `participant_id` in
(SELECT `users`.`id`
FROM `users`
WHERE `users`.`deleted_at` IS NULL)
From what I can gather doing this will create a massive table, which only then will start applying where's. In my mind it should be possible to short circuit a lot of that because once we get a match for a user, we can disregard that going forward. That would be how to handle it, in my mind, in the application layer.
We could do this on a per-user basis in the application layer, but the number of requests to the database would make this a bad solution.
I have tagged it as PHP as well as MySQL, not because it has to be PHP but because I do not mind offloading some parts to the application layer if that is required. It's my experience that joins do not always use indexes optimally
Edit:
To specify my question: Can someone help me provide a efficient way to pull out the number of non-deleted users that have been assigned to to active non-deleted courses?
I would write it this way:
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL may join the tables in another order, not the order you list the tables in the query.
I hope that courses is the first table MySQL accesses, because it's probably the smallest table. Especially after filtering by active and deleted_at. The following index will help to narrow down that filtering, so only matching rows are examined:
ALTER TABLE courses ADD KEY (active, deleted_at);
Every index implicitly has the table's primary key (e.g. id) appended as the last column. That column being part of the index, it is used in the join to participants. So you need an index in participants that the join uses to find the corresponding rows in that table. The order of columns in the index is important.
ALTER TABLE participants ADD KEY (course_id, participant_type, participant_id);
The participant_id is used to join to the users table. MySQL's optimizer will probably prefer to join to users by its primary key, but you also want to restrict that by deleted_at, so you might need this index:
ALTER TABLE users ADD KEY (id, deleted_at);
And you might need to use an index hint to coax the optimizer to prefer this secondary index over the primary key index.
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u USE INDEX(deleted_at)
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL knows how to use compound indexes even if some conditions are in join clauses and other conditions are in the WHERE clause.
Caveat: I have not tested this. Choosing indexes may take several tries, and testing the EXPLAIN after each try.

yii2 data provider query takes very long time

I am using yii2 data Provider to extract data from database. Raw query looks like this
SELECT `client_money_operation`.* FROM `client_money_operation`
LEFT JOIN `user` ON `client_money_operation`.`user_id` = `user`.`id`
LEFT JOIN `client` ON `client_money_operation`.`client_id` = `client`.`id`
LEFT JOIN `client_bonus_operation` ON `client_money_operation`.`id` = `client_bonus_operation`.`money_operation_id`
WHERE (`client_money_operation`.`status`=0) AND (`client_money_operation`.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code` ORDER BY `created_at` DESC LIMIT 10
this query takes 107 seconds to execute.
Table client_money operations contains 132000 rows. What do I need to do to optimise this query, or set up my database properly?
Try pagination. But if you must have to show large set of records in one go remove as many left joins as you can. You can duplicate some data in the client_money_operation table if it is certainly required to show in the one-go result set.
SELECT mo.*
FROM `client_money_operation` AS mo
LEFT JOIN `user` AS u ON mo.`user_id` = u.`id`
LEFT JOIN `client` AS c ON mo.`client_id` = c.`id`
LEFT JOIN `client_bonus_operation` AS bo ON mo.`id` = bo.`money_operation_id`
WHERE (mo.`status`=0)
AND (mo.`created_at` BETWEEN 1 AND 1539723600)
GROUP BY `operation_code`
ORDER BY `created_at` DESC
LIMIT 10
is a rather confusing use of GROUP BY. First, it is improper to group by one column while having lots of non-aggregated columns in the SELECT list. And the use of created_at in the ORDER BY does not make sense since it is unclear which date will be associated with each operation_code. Perhaps you want MIN(created_at)?
Optimization...
There will be a full scan of mo and (hopefully) PRIMARY KEY lookups into the other tables. Please provide EXPLAIN SELECT ... so we can check this.
The only useful index on mo is INDEX(status, created_at), and it may or may not be useful, depending on how big that date range is.
bo needs some index starting with money_operation_id.
What table(s) are operation_code and created_at in? It makes a big difference to the Optimizer.
But there is a pattern that can probably be used to greatly speed up the query. (I can't give you details without knowing what table those columns are in, nor whether it can be made to work.)
SELECT mo.*
FROM ( SELECT mo.id FROM .. WHERE .. GROUP BY .. ORDER BY .. LIMIT .. ) AS x
JOIN mo ON x.id = mo.id
ORDER BY .. -- yes, repeated
That is, first do (in a derived table) the minimal work to find ids for the 10 rows desired, then use JOIN(s) to fetch there other columns needed.
(If yii2 cannot be made to generate such, then it is in the way.)

Query performance issue

am working with mySql, and with below query am getting performance issue:
SELECT COUNT(*)
FROM
(SELECT company.ID
FROM `company`
INNER JOIN `featured_company` ON (company.ID=featured_company.COMPANY_ID)
INNER JOIN `company_portal` ON (company.ID=company_portal.COMPANY_ID)
INNER JOIN `job` ON company.ID = job.COMPANY_ID
WHERE featured_company.DATE_START<='2016-09-21'
AND featured_company.DATE_END>='2016-09-21'
AND featured_company.PORTAL_ID=16
AND company_portal.PORTAL_ID=16
AND (company.IMAGE IS NOT NULL
AND company.IMAGE<>'')
AND job.IS_ACTIVE=1
AND job.IS_DELETED=0
AND job.EXPIRATION_DATE >= '2016-09-21'
AND job.ACTIVATION_DATE <= '2016-09-21'
GROUP BY company.ID)
with this query am getting below newrelic log (Query analysis:
Table - Hint):
featured_company
- The table was retrieved with this index: portal_date_start_end
- A temporary table was created to access this part of the query, which can cause poor performance. This typically happens if the query contains GROUP BY and ORDER BY clauses that list columns differently.
- MySQL had to do an extra pass to retrieve the rows in sorted order, which is a cause of poor performance but sometimes unavoidable.
- You can speed up this query by querying only fields that are within the index. Or you can create an index that includes every field in your query, including the primary key.
Approximately 89 rows of this table were scanned.
company_portal
- The table was retrieved with this index: PRIMARY
- Approximately 1 row of this table was scanned.
job
- The table was retrieved with this index: company_expiration_date
- You can speed up this query by querying only fields that are within the index. Or you can create an index that includes every field in your query, including the primary key.
- Approximately 37 rows of this table were scanned.
company
- The table was retrieved with this index: PRIMARY
- You can speed up this query by querying only fields that are within the index. Or you can create an index that includes every field in your query, including the primary key.
- Approximately 1 row of this table was scanned.
I don't get idea what more I can do for this query optimization, please provide ideas if you have
Be sure you have proper index on :
featured_company.DATE_START
featured_company.PORTAL_ID
job.IS_ACTIVE
job.IS_DELETED
job.EXPIRATION_DATE
job.ACTIVATION_DATE
and eventually
company.IMAGE
Assuming that the id are already indexed
company.ID
featured_company.COMPANY_ID
job.COMPANY_ID
and a suggestion based on the fact you don't use aggregation function don't use group by use DISTINCT instead
company.ID
featured_company.COMPANY_ID
job.COMPANY_ID
SELECT COUNT(*) FROM (
SELECT DISTINCT company.ID
FROM `company`
INNER JOIN `featured_company` ON company.ID=featured_company.COMPANY_ID
INNER JOIN `company_portal` ON company.ID=company_portal.COMPANY_ID
INNER JOIN `job` ON company.ID = job.COMPANY_ID
WHERE featured_company.DATE_START<='2016-09-21'
AND featured_company.DATE_END>='2016-09-21'
AND featured_company.PORTAL_ID=16
AND company_portal.PORTAL_ID=16
AND (company.IMAGE IS NOT NULL AND company.IMAGE<>'')
AND job.IS_ACTIVE=1
AND job.IS_DELETED=0
AND job.EXPIRATION_DATE >= '2016-09-21'
AND job.ACTIVATION_DATE <= '2016-09-21'
)

Joining tables based on reference table value

Have these tables
and given post_map.tag_id='1', I would like to get:
entity_type table determines what table be look in, i.e. what table entity_id is stored in. My main goal is to get this table as the result of either mysqli::multi_query() or mysqli::query(), i.e. without PHP going back and forth to the database multiple times; this table may have many many rows and getting this table at once would much more efficient.
My attempts thus far:
I have tried JOIN clause but I don't know how to use the row value of prior SELECT as the table name for the JOIN clause.
Tried Prepared Statments but can't form anything usable.
It can be done by IF() and JOIN. I have solution for you, run below query...
SELECT et.type,
IF(et.type='resource',r.resource_type_id,NULL) AS resource_type_id,
IF(et.type='resource',r.value,NULL) AS value,
IF(et.type='user',u.name,NULL) AS name,
IF(et.type='link',l.source,NULL) AS source,
IF(et.type='link',l.count,NULL) AS count
FROM `post_map` as pm
JOIN `entity_type` as et ON pm.entity_id = et.id
LEFT JOIN `resource` as r ON pm.entity_type_id=r.id
LEFT JOIN `user` as u ON pm.entity_type_id=u.id
LEFT JOIN `link` as l ON pm.entity_type_id=l.id
WHERE pm.tag_id='1'

SQL Optimization WHERE vs JOIN

I am using mysql and this is a query that I quickly wrote, but I feel this can be optimized using JOINS. This is an example btw.
users table:
id user_name first_name last_name email password
1 bobyxl Bob Cox bob#gmail.com pass
player table
id role player_name user_id server_id racial_name
3 0 boby123 1 2 Klingon
1 1 example 2 2 Race
2 0 boby2 1 1 Klingon
SQL
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`,`users`
WHERE `users`.`id` = 1
and `users`.`id` = `player`.`user_id`
I know I can use a left join but what are the benefits
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
LEFT JOIN `users`
ON `users`.`id` = `player`.`user_id`
WHERE `users`.`id` = 1
What are the benefits, I get the same results ether way.
Your query has a JOIN in it. It is the same as writing:
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
INNER JOIN `users` ON `users`.`id` = `player`.`user_id`
WHERE `users`.`id` = 1
The only reason for you to use left join is if you want to get data from player table even when you don't have matches in users table.
LEFT JOIN will get data from the left table even if there's no equal data from the right side table.
I guess at one point, that player table's data will not be equivalent to users table specially if the data on users table has not been inserted into player table.
Your first query might return null on cases that the 2nd table (player) has no equivalent data corresponding to users table.
Also, IMHO, setting up another table for servers is a good idea in terms of complying to the normalization rules in database structure. After all, what details of the server_id is the column on player table pointing to.
The first solution makes a direct product (gets and connects everything with everything) then drops away the bad results. If you have a lot of rows this will be very slow!
The left join gets first the left table then put only the matching rows from the right (or null).
In your example you don't even need join. :)
This'll give you the same result and it'll be good until you just check for user id:
SELECT `player`.`server_id`,`player`.`id`,`player`.`player_name`,`player`.`racial_name`
FROM `player`
WHERE `player`.`user_id` = 1
Another solution if you want more conditions, without join could be something like this:
SELECT * FROM player WHERE player.user_id IN (SELECT id FROM user WHERE ...... )

Categories