Optimizing MySQL Query with "NOT IN"

Optimizing MySQL Query with "NOT IN" - php

I've seen a few questions dabbling with the inefficiency of "NOT IN" in MySQL queries, but I didn't manage to reproduce the proposed solutions.
So I've got some sort of search engine. It starts with very simple queries, and then tries more complicated ones if it doesn't find enough results. Here's how it works in pseudocode
list_of_ids = do_simple_search()
nb_results = size_of(list_of_ids)
if nb_results < max_nb_results :
list_of_ids .= do_search_where_id_not_in(list_of_ids)
if nb_results < max_nb_results :
list_of_ids .= do_complicated_search_where_id_not_in(list_of_ids)
Hope I'm clear.
Anyway here's the slow query, as shown by MySQL-slow :
SELECT DISTINCT c.id
FROM clients c LEFT JOIN communications co ON c.id = co.client_id
WHERE (co.titre LIKE 'S' OR co.contenu LIKE 'S') AND c.id NOT IN(N)
LIMIT N, N
And here's an EXPLAIN on that query :
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c index PRIMARY PRIMARY 2 NULL 25250 Using where; Using index; Using temporary
1 SIMPLE co ref qui_com,id_client,titre id_client 2 klients.c.id 8 Using where; Distinct
MySQL version is 5.1.63-0ubuntu0.11.04.1-log
Maybe my approach is wrong here ? How would you do it ? thanks.

A couple of remarks:
1) Why do you do LEFT JOIN i/o (INNER) JOIN? LEFT JOIN means that you want to also get the records which are not matched agains clients, is that the intention? If not, then JOIN i/o LEFT JOIN is quicker.
2) Why do you need JOIN at all if you can simply do:
SELECT DISTINCT co.client_id from communications co
WHERE (co.titre LIKE 'S' OR co.contenu LIKE 'S') AND co.id!=N LIMIT N,N;
Also, if you do a JOIN, both joined fields must be indexes, otherwise it's slow too.
More importantly, you condition both client_id and id from communications table, but there is no common index for these both, which means more work to execute your query (hence using temporary which is in general not a good sign).
3) You do a complex condition on both co.titre and co.contenu, you seem to have indexes but they are not used. That means this part might be potentially quite slow.

Related

optimise a mysql query to reduce rows searched

I'm currently grabbing the last 5 comments posted on my website, which I am seemingly doing quite badly I think.
Here is the SQL query:
SELECT c.comment_id
, c.article_id
, c.time_posted
, a.title
, a.slug
, u.username
FROM articles_comments c
JOIN articles a
ON c.article_id = a.article_id
JOIN users u
ON u.user_id = c.author_id
WHERE a.active = 1
AND c.approved = 1
ORDER
BY c.comment_id DESC
LIMIT 5
My problem, is that has to search through a lot of rows, it seems quite wasteful. I'm curious if there's a better way to do it.
Here's the output of explain on it:
As you can see, the rows it's giving is 81,486 which seems kind of hilarious. What am I missing here?

Turns out, simply forcing articles_comments to use the PRIMARY key (comment_id) as the INDEX fixed it.
The issue is that my sorting is picking ALL approved comments, so it was using the approved column to sort resulting in it picking data from tens of thousands.

c: INDEX(approved, comment_id) -- in this order
a: I assume you have PRIMARY KEY(article_id)
u: I assume you have PRIMARY KEY(user_id)
The hope is that the c index will handle some of the WHERE, plus the ORDER BY and LIMIT. The worst case is that it must scan the entire table without finding 5 rows.
The 81,486 is bogus. Here's a precise way to get good info:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The 'reads' will indicate how many rows of data and index were touched; the writes indicate temp table(s) being used.

PHP, MySQL, Huge Join, processing speed

This is more of a theoretical query than anything else, but I have a complex join (resulting in upwards of 1900 records in the main table, combined with all the sub-result tables in the join -- join shown below), the resulting web page is taking 5-10 minutes on my local machine to process and complete building. I realize this could easily be many factors, but am hoping to get some hints. Basically I am loading an array of names from two tables (one is cross-references, so the array is used to sort the data on the names, with links and a field noting if it is a cross reference), then if a name is not a cross reference, I issue this join:
select
n.NameCode, n.AL_NameCode, n.Name, n.Name_HTML, n.Region, n.Local, n.Deceased,
n.ArmsLink, n.RollOfArms, n.Blazon, n.PreferredTitle, n.ShortBio,
n.HeadShotPhoto, n.HeadShotPhotographer, n.HeadShotContributor,
x.NameCode, x.NameAKA, x.AlternateName,
g.NameLink, g.`Group Name`,
p.NameLink, p.`Relationship Type`, p.`Related To Link`,
p2.Position_ID, p2.NameLink, p2.`Position Held`, p2.`Times Held`,
p2.`Date Started`, p2.`Date Ended`, p2.Hyperlink as pos_Hyperlink,
p2.`Screentip Text`,
a.`Name Link`, a.Description, a.EventDate, a.Hyperlink, a.`Screentip Text`,
a.ExternalLink
from who_names as n
left outer join who_crossref as x on n.NameCode=x.NameCode
left outer join who_groups as g on n.NameCode=g.NameLink
left outer join who_personal as p on n.NameCode=p.NameLink
left outer join who_positions as p2 on n.NameCode=p2.NameLink
left outer join who_arts as a on n.NameCode=a.`Name Link`
where n.NameCode = ?
order by n.Name desc, g.`Group Name`, p2.`Date Started`, a.EventDate;
In order to output the various parts of the data, I:
1) Start a table,
2) Output the name and some other info in the first row,
3) Then in order to process, say, the groups (sub-groups someone associates themselves with within the organization), I issue:
mysqli_data_seek( $result, 0 ); // to rewind to top of data so we're at first row
and see if there's anything to process for subgroups (not everyone has anything ...),
4) I repeat for personal relationships, and other sections, going back to the top of the data and looping back through if there's anything to process.
When done with that individual, I close off the table, and loop back in the array to the next name, and repeat ...
While this works, 5-10 minutes is way to long to load a web page.
I am pondering ideas to resolve this, but I am not sure if it is any specific aspect of my code. Is it the seeks back to the top of the rowset returned? Is it the tables in the browser? Is it a combination of both (very possibly)? The program is too big to post here in its entirety. I am feeling rather flummoxed at how to resolve this, and hoping someone has some pointers to help me speed the processing up, and I hope the details I've given are enough to give something to work with.
Based on comments and feedback below, in PHP Admin, I did the following:
explain select n.NameCode, n.AL_NameCode, n.Name, n.Name_HTML, n.Region, n.Local, n.Deceased,
n.ArmsLink, n.RollOfArms, n.Blazon, n.PreferredTitle, n.ShortBio, n.HeadShotPhoto,
n.HeadShotPhotographer, n.HeadShotContributor,
x.NameCode, x.NameAKA, x.AlternateName,
g.NameLink, g.`Group Name`,
p.NameLink, p.`Relationship Type`, p.`Related To Link`,
p2.Position_ID, p2.NameLink, p2.`Position Held`, p2.`Times Held`, p2.`Date Started`,
p2.`Date Ended`, p2.Hyperlink as pos_Hyperlink, p2.`Screentip Text`,
a.`Name Link`, a.Description, a.EventDate, a.Hyperlink, a.`Screentip Text`,
a.ExternalLink
from who_names as n
left outer join who_crossref as x on n.NameCode=x.NameCode
left outer join who_groups as g on n.NameCode=g.NameLink
left outer join who_personal as p on n.NameCode=p.NameLink
left outer join who_positions as p2 on n.NameCode=p2.NameLink
left outer join who_arts as a on n.NameCode=a.`Name Link`
where n.NameCode=638
order by n.Name desc, g.`Group Name`, p2.`Date Started`, a.EventDate
This returned:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE n const PRIMARY,ix1_names PRIMARY 4 const 1 Using temporary; Using filesort
1 SIMPLE x ref ix2_crossref ix2_crossref 4 const 1 NULL
1 SIMPLE g ref ix3_groups ix3_groups 4 const 3 NULL
1 SIMPLE p ref ix4_personal ix4_personal 4 const 1 NULL
1 SIMPLE p2 ref ix5_positions ix5_positions 4 const 13 NULL
1 SIMPLE a ref ix6_arts ix6_arts 4 const 28 NULL
Which appears to just be a list of the indexes, so it doesn't seem to be helping me.

Since you are using a SINGLE main table and the rest of the joins are all OUTER JOIN there's a single most important index that can make your query faster:
create index ix1_names on who_names (NameCode, Name);
Also, the Nested Loop Joins (NLJ) against the related tables will benefit of the following indexes. You may already have several of these so check if you have them first. If you don't, then create them:
create index ix2_crossref on who_crossref (NameCode);
create index ix3_groups on who_groups (NameLink);
create index ix4_personal on who_personal (NameLink);
create index ix5_positions on who_positions (NameLink);
create index ix6_arts on who_arts (`Name Link`);
But again, it's the first one the one I consider the most important one.
You'll need to test for real to see if the performance improves with it/them.
If the query is still slow, please retrieve the execution plan, as #memo suggested, by using:
explain select ...

First, try removing the "order by" clause and see if that improves anything. Sometimes it can happen that the query itself is fast, but the re-ordering is slow, requiring temporary files.
Second, feed the query to an EXPLAIN statement (e.g. EXPLAIN SELECT whathaveyou FROM table...). Check out the output for bottlenecks, missing indexes etc. (https://dev.mysql.com/doc/refman/8.0/en/using-explain.html)

After a lot of work, I found a few issues that I was able to resolve: I was (thinking it made sense at the time) opening some tables when they weren't necessary to get row counts; I dropped the big join and just opened the sub-tables as needed; cleaned up a few other places in the code; added a few more indexes on another set of tables that weren't in the original join. I was able to reduce the speed from 4 minutes to 45 seconds. While 45 seconds is a long time to load a page, I figure since this page was handling up to 1500 (sometimes more) primary records, and pulling data from up to 10 different tables, formatting (tables inside tables, etc.), that 45 seconds is probably doable, with a note at the top of the page and a progress bar that displays while loading the page. Thanks, all. The indexes did help, and the other explanations also helped a lot.

How to improve the performance of MYSQL query with large data?

I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?

Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.

Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).

It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.

Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html

Problems with MYSQL query (efficiency)

I'm having real problems with my Mysql statement, I need to join a few tables together, query them and order by the average of values from another table. This is what I have...
SELECT
ROUND(avg(re.rating), 1)AS avg_rating,
s.staff_id, s.full_name, s.mobile, s.telephone, s.email, s.drive
FROM staff s
INNER JOIN staff_homes sh
ON s.staff_id=sh.staff_id
INNER JOIN staff_positions sp
ON s.staff_id=sp.staff_id
INNER JOIN reliability re
ON s.staff_id=re.staff_id
INNER JOIN availability ua
ON s.staff_id=ua.staff_id
GROUP BY staff_id
ORDER BY avg_rating DESC
Now I believe this to work although I am getting this error "The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET SQL_MAX_JOIN_SIZE=# if the SELECT is okay".
I think this means that I have too many joins and because it is shared hosting it won't allow large queries to run I don't know.
What I would like to know is exactly what the error means (I have googled it but I don't understand the answers) and how I can work round it by maybe making my query more efficient?
Any help would be appreciated. Thanks
EDIT:
The reason I need the joins is so I can query the tables based on a search function like so...
SELECT
ROUND(avg(re.rating), 1)AS avg_rating
, s.staff_id, s.full_name, s.mobile, s.telephone, s.email, s.drive
FROM staff s
INNER JOIN staff_homes sh
ON s.staff_id=sh.staff_id
INNER JOIN staff_positions sp
ON s.staff_id=sp.staff_id
INNER JOIN reliability re
ON s.staff_id=re.staff_id
INNER JOIN availability ua
ON s.staff_id=ua.staff_id
WHERE s.full_name LIKE '%'
AND s.drive = '1'
AND sh.home_id = '3'
AND sh.can_work = '1'
AND sp.position_id = '3'
AND sp.can_work = '1'
GROUP BY staff_id
ORDER BY avg_rating DESC
EDIT 2
This was the result of my explain. Also I'm not great with MYSQL how would I set up foreign keys?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ua ALL NULL NULL NULL NULL 14 Using temporary; Using filesort
1 SIMPLE re ALL NULL NULL NULL NULL 50 Using where; Using join buffer
1 SIMPLE sp ALL NULL NULL NULL NULL 84 Using where; Using join buffer
1 SIMPLE sh ALL NULL NULL NULL NULL 126 Using where; Using join buffer
1 SIMPLE s eq_ref PRIMARY PRIMARY 4 web106-prestwick.ua.staff_id 1
EDIT 3: Thanks lc, it was my foreign keys, they were not set up correctly. Problem sorted

Maybe you should use more and/or better indexes on the tables.

According to the db you're using, the optimization may be faster or not with subqueries.
There may be 2 bottlenecks on your query:
try to remove the average function of your query. If your query speeds up, try to replace it with a subquery and see what happens.
The multiple joins often reduce performances, and there's nothing you can do except modifying your db schema. The simplest solution would be to have a table that precomputes data and reduces the work for your db engine. This table can be fulfilled with stored procedures triggered when you modify the data on the implied tables, or you can also modify the table values from your php application.

MySQL ORDER BY Optimization on Multiple Joins

I need some help optimizing some queries for my database. I do understand the use of indexes in helping with joins and order by statements to help speed things up, but I was wondering if there were some techniques available to avoid using filesort and using temporary when I use the EXPLAIN command. Here's an example of what I am using.
SELECT a.id, DATE_FORMAT(a.submitted_at, '%d-%b-%Y') as submitted_at, a.user_id,
data1.*,
data2.name, data2.type,
u.first_name, u.last_name
FROM applications AS a
LEFT JOIN users AS u ON u.id = a.user_id
LEFT JOIN score_table AS data1 ON data1.applications_id = a.id
LEFT JOIN sections AS data2 ON data2.id = data1.section_id
WHERE category_id = [value] && submitted_at IS NOT NULL
ORDER BY data2.type
Again, indexes are being used properly in my queries like the one up above. If I take out the ORDER BY clause, then the query executes quickly from using the proper indexes. I understand that the order of the joins can affect the performance of the query. When I test using the ORDER BY on the users table, since it is the next table after the "const", it will only use "Using where, Using Filesort" on EXPLAIN. If I drop to any of the other tables, we get into the "Using Temporary" issue.
My question is: what would be the best way to optimize queries like this to run faster and in the best case scenario, avoid using filesort/temporary in EXPLAIN? I'm open to any possibilities :) I'm more or less interested in the theory on how to make queries like this perform better, than this exact query as I am having to perform more and more of these deep level ORDER BY queries in the database I'm working on.
--EDIT--
Here's the explain of the query above.....
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a ref category_id,submitted_at category_id 4 const 49 Using where; Using temporary; Using filesort
1 SIMPLE u eq_ref PRIMARY PRIMARY 4 a.user_id 1
1 SIMPLE data1 ref app id app id 4 a.id 7
1 SIMPLE data2 eq_ref PRIMARY PRIMARY 4 data1.section_id 1

Couple of things.
Are you sure you need to use 'LEFT JOIN'? Looking at the query it looks like you can get away with 'INNER JOIN' which will reduce the number of potential rows.
You didn't post your schema, but I assume that users.id, applications.user_id, score_table.applications_id, applications.id, sections.id and score_table.section_id are all ints? If they are non-ints I would strongly urge you to convert them. And if not primary keys, be sure they are indexed.
I wouldn't run any mysql level data formatting (i.e. DATE_FORMAT), as it will create some overhead during the query, rather I would format data like this at the app layer.
The ORDER BY forces MySQL to create a temp table in order to sort correctly, so be sure you absolutely need this functionality. If so, be sure that sections.type is indexed.
I would consider using a different alias naming convention. data1 and data2 are so abstract it's difficult to discern what they are actually referring to. I would suggest you use an abbreviated construct of the table you are aliasing, for example; applications becomes app (instead of a), score_table becomes score (instead of data1), etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.