I programmed a filter which generates a Query to show special employees.
I have table employees and a lot of 1:1, 1:n and n:m relationships e.g. for skills and languages for the employees like this:
Employees
id name
1 John
2 Mike
Skills
id skill experience
1 PHP 3
2 SQL 1
Employee_Skills
eid sid
1 1
1 2
Now I want to filter employees which have at least 2 years experience in using PHP and 1 year SQL.
My filter always generates a correct working Query for every table, relationship and field.
But now my problem is when I would like to filter the same field in a related table multiple times with a and it does not work.
e.g.
John PHP 3
John SQL 1
PHP and SQL are different rows so AND can not work.
I tried using group_concat and find_in_set but I have the problem that I can not filter experience over 2 years with find_in_set and find_in_set does not know PHP is 3 and SQL is 1.
I also tried
WHERE emp.id IN (SELECT eid FROM Employee_Skills WHERE sid IN (SELECT id FROM Skills WHERE skill = 'PHP' AND experience > 1)) AND emp.id IN (SELECT eid FROM Employee_Skills WHERE sid IN (SELECT id FROM Skills WHERE skill = 'SQL' AND experience > 0))
which works for this example, but it only works for n:m and it too complex to know the relationship type.
I have the final Query with
ski.skill = 'PHP' AND ski.experience > 1 AND ski.skill = 'SQL' AND ski.experience > 0
and I would like to manipulate the Query to make it work.
How does a Query have to look like to deal with relational division.
you can try next approach:
select * from Employees
where id in (
select eid
from Employee_Skills as a
inner join
Skills as ski
on (a.sid = ski.id)
where
(ski.skill = 'PHP' AND a.experience > 2) OR
(ski.skill = 'SQL' AND a.experience > 1)
group by eid
having count(*) = 2
)
so, for every filter you will add OR statement, having will filter employees with all filters passed, just pass appropriate number
You could make a kind of pivot query, where you put the experience in each of all of the known skills in columns. This could be a long query, but you could build it dynamically in php, so it would add all skills as columns to the final query, which would look like this:
SELECT e.*, php_exp, sql_exp
FROM Employee e
INNER JOIN (
SELECT es.eid,
SUM(CASE s.skill WHEN 'PHP' THEN s.experience END) php_exp,
SUM(CASE s.skill WHEN 'SQL' THEN s.experience END) sql_exp,
SUM(CASE s.skill WHEN 'JS' THEN s.experience END) js_exp
-- do the same for other skills here --
FROM Employee_Skills es
INNER JOIN Skills s ON es.sid = s.id
GROUP BY es.eid
) pivot ON pivot.eid = e.id
WHERE php_exp > 2 AND sql_exp > 0;
The WHERE clause is then very concise and intuitive: you use the logical operators like in other circumstances.
If the set of skills is rather static, you could even create a view for the sub-query. Then the final SQL is quite concise.
Here is a fiddle.
Alternative
Using the same principle, but using the SUM in the HAVING clause, you can avoid gathering all skill's experiences:
SELECT e.*
FROM Employee e
INNER JOIN (
SELECT es.eid
FROM Employee_Skills es
INNER JOIN Skills s ON es.sid = s.id
GROUP BY es.eid
HAVING SUM(CASE s.skill WHEN 'PHP' THEN s.experience END) > 2
AND SUM(CASE s.skill WHEN 'SQL' THEN s.experience END) > 0
) pivot ON pivot.eid = e.id;
Here is a fiddle.
You can also replace the CASE construct by the IF function, like this:
HAVING SUM(IF(s.skill='PHP', s.experience, 0)) > 2
... etc.
But it comes down to the same.
The straightforward way would be to repeatedly JOIN the skills:
SELECT e.*
FROM Employees AS e
JOIN Employee_Skills AS j1 ON (e.id = j1.eid)
JOIN Skills AS s1 ON (j1.sid = s1.id AND s1.skill = 'PHP' AND s1.experience > 3)
JOIN Employee_Skills AS j2 ON (e.id = j2.eid)
JOIN Skills AS s2 ON (j2.sid = s2.id AND s2.skill = 'SQL' AND s2.experience > 1)
...
Since all the clauses are required this translated to a straight JOIN.
You will need to add two JOINs for each clause, but they're quite fast joins.
A more hackish way would be to compress the skills into a code in a 1:1 relation with the employees. If experience never exceeds, say, 30, then you can multiply the first condition's experience by 1, the second by 30, the third by 30*30, the fourth by 30*30*30... and never get an overflow.
SELECT eid, SUM(CASE skill
WHEN 'PHP' THEN 30*experience
WHEN 'SQL' THEN 1*experience) AS code
FROM Employees_Skills JOIN Skills ON (Skills.id = Employees_Skills.sid)
GROUP BY eid HAVING code > 0;
Actually since you want 3 years PHP, you can HAVE code > 91. If you had three conditions with experiences 2, 3 and 5, you would request more than x = 2*30*30 + 3*30 + 5. This only serves to whittle the results, since 3*30*30 + 2*30 + 4 still passes the filter but is of no use to you. But since you want a restriction on code, and "> x" costs the same as "> 0" and gives better results... (if you needed more complex filtering than a series of AND, > 0 is safer, though).
The table above you join with Employees, then on the result you perform the true filtering, requiring
((code / 30*30) % 30) > 7 // for instance :-)
AND
((code / 30) % 30) > 3 // for PHP
AND
((code / 1) % 30) > 1 // for SQL
(the *1 and /1 are superfluous, and only inserted to clarify)
This solution requires a full table scan on Skills, with no real possibility of automatic optimizations. So it is slower than the other solution. On the other hand, its cost grows much more slowly, so if you have complex queries, or need OR operators or conditional expressions instead of ANDs, it may be more convenient to implement the "hackish" solution.
Related
I have 2 tables which share a 1 to many relationship. Assume the following structure:
users users_metadata
------------- -------------
id | email id | user_id | type | score
A user can have many metadata. The users table has 100k rows, the users_metadata table has 300k rows. It'll likely grow 10x so whatever I write needs to be optimal for large amounts of data.
I need to write a sql statement that returns only user emails that pass a couple of different score conditions found in the metadata table.
// if type = 1 and if score > 75 then <1 point> else <0 points>
// if type = 2 and if score > 100 then <1 point> else <0 points>
// if type = 3 and if score > 0 then [-10 points] else <0 points>
// there are other types that we want to ignore in the score calculations
If the user passes a threshold (e.g. >= 1 point) then I want that user to be in the resultset, otherwise I want the user to be ignored.
I have tried user a stored function/cursor that takes a user_id and loops over the metadata to figure out the points, but the resulting execution was very slow (although it did work).
As it stands I have this, and it takes about 1 to 3 seconds to execute.
SELECT u.id, u.email,
(
SELECT
SUM(
IF(k.type = 1, IF(k.score > 75, 1, 0), 0) +
IF(k.type = 2, IF(k.score > 100, 1, 0), 0) +
IF(k.type = 3, IF(k.score > 0, 1, -10), 0)
)
FROM user_metadata k WHERE k.user_id = u.id
) AS total
FROM users u GROUP BY u.id HAVING total IS NOT NULL;
I feel like at 10x this is going to be even slower. a 1 to 3 second query execution time is too slow for what I need already.
What would a more optimal approach be?
If I use a language like PHP for this too, would running 2 queries, one to fetch user_ids from user_metadata of only passing users, and then a second to SELECT WHERE IN on that list of ids be better?
Try using a JOIN instead of correlated subquery.
SELECT u.id, u.email, t.total
FROM users AS u
JOIN (
SELECT user_id, SUM(CASE type
WHEN 1 THEN score > 75
WHEN 2 THEN score > 100
WHEN 3 THEN IF(k.score > 0, 1, -10)
END) AS total
FROM user_metadata
GROUP BY user_id
HAVING total >= 1
) AS t ON u.id = t.user_id
Doing the grouping and filtering in the subquery makes the join smaller, which can be a significant performance boost.
There's also no need for you to use GROUP BY u.id in your query, since that's the primary key of the table you're querying; hopefully MySQL will optimize that out.
There are two type of questions there 1.Passage and 2.Normal questions.
usally in test i want to pick random questions which consist type_id=0 in that if type=1 question come the the next passage should be relates to that question(Comprehension question should come in sequential). By using the below query i am able to get the questions
SELECT *
FROM tbl_testquestion
ORDER BY
CASE
WHEN type_id=0 THEN RAND()
WHEN type_id=1 THEN qu_id
END ASC
all the passage questions are coming last
and i have limit of 40 questions for test and in the table i have 50 passage questions and 70 Normal questions.
How can i write a query to call passage questions in between normal
questions.
EXAMPLE
1.who is the president of America.?(type_id=0)
2.A,B,C are 3 students Aname is "Arun" B name is "Mike" C name is "Jhon"(type_id=1)
who is C from the above passage
3.A,B,C are 3 students Aname is "Arun" B name is "Mike" C name is "Jhon"(type_id=1)
who is A from the above passage
4.Who is CEO of Facebook.?(type_id=0)
Form the Above 4 question we will pick random if Question 1 comes in that rand() no problem when the question 2 comes in the rand() the next question should be sequential. it means next question should be 3 after that passage questions completed it should switch back to rand() functionality
I think that the design of your database should be improved, but I’m going to answer your question as it stands.
I think I have a rather simple solution, which I can express in portable SQL without CTE’s.
It works this way: let’s assign two numbers to each row, call them major (an integer, just to be safe let’s make it a multiple of ten) and minor (a float between 0 and 1). For type 0 questions, minor is always 0. Each type 1 question relating to the same passage gets the same major (we do this with a join with a grouped subselect). We then order the table by the sum of the two values.
It will be slow, because it joins using a text field. It would be better if each distinct passage_description had an integer id to be used for the join.
I assume that all type 0 questions have empty or null passage_description, while type 1 questions have them non-empty (it would make no sense otherwise.)
I assume you have a RAND() function which yields floating values between 0 and 1.
Here we go:
SELECT u.qu_id, u.type_id,
u.passage_description, u.passage_image,
u.cat_id, u.subcat_id,
u.question, u.q_instruction, u.qu_status
FROM (
SELECT grouped.major, RAND()+0.001 AS minor, t1.*
FROM tbl_testquestion t1
JOIN (SELECT 10*FLOOR(1000*RAND()) major, passage_description
FROM tbl_testquestion WHERE type_id = 1
GROUP BY passage_description) grouped
USING (passage_description)
-- LIMIT 39
UNION
SELECT 10*FLOOR(1000*RAND()) major, 0 minor, t0.*
FROM tbl_testquestion t0 WHERE type_id = 0
) u ORDER BY u.major+u.minor ASC LIMIT 40;
With the above query without modifications, there is still a small probability that you get questions of only one type. If you want to be sure that you have at least one type 0 question, you can uncomment the LIMIT 39 on the first part of the UNION. If you want at least two, then say LIMIT 38, and so on. All type 1 questions related to the same passage will be grouped together in one test; it is not guaranteed that all questions in the database related to that passage will be in the test, but in a comment above you mention that this can be “broke”.
Edited:
I added a small amount to minor, just to bypass the rare but possible case in which RAND() returns exactly zero. Since major goes by tens, the fact that minor might now be greater than one is immaterial.
Use the following, I haven't tested this so, if there are any errors please report back, I will correct them. $r is a random value produced by PHP for this query. You could do $r = rand(); before calling the query
SELECT * FROM (
UNION((
SELECT *, RAND()*(SELECT COUNT(*) FROM tbl_testquestions) as orderid
FROM tbl_testquestion
WHERE type_id=0
ORDER BY orderid
LIMIT 20
),(
SELECT *, MD5(CONCAT('$r', passage_description)) as orderid
FROM tbl_testquestion
WHERE type_id=1
ORDER BY orderid
LIMIT 20
))
) AS t1
ORDER BY orderid
Explanation: orderid will keep type_id=1 entries together as it would produce the same random sequence for the same passage questions.
Warning: Unless you add passage_id to the table, this question will work quite slowly.
Edit: Fixed the ordering (I hope), forgot that MYSQL generates random numbers between 0 and 1.
This is the solution for mysql,
sorry it is not so readable because mysql does not supports CTE like sql-server.
Maybe you can compare with sql-server CTE syntax to the bottom to better understand how it works.
select
d.*
, o.q_ix, rnd_ord -- this is only for your reference
from (
select *, floor(rand()*1000) as rnd_ord -- this is main order for questions and groups
from (
select * from (
select
(#r1 := #r1 - 1) as q_ix, -- this is row_number() (negative so we can keep group separated)
passage_description, 0 qu_id, type_id
from (
select distinct passage_description, type_id
from tbl_testquestion,
(SELECT #r1 := 0) v, -- this is the trick for row_number()
(SELECT #rnd_limit := -floor(rand()*3)) r -- this is the trick for dynamic random limit
where type_id=1
) p
order by passage_description -- order by for row_number()
) op
where q_ix < #rnd_limit
union all
select * from (
select
(#r2 := #r2 + 1) as q_ix, -- again row_number()
'' as passage_description, qu_id, type_id
from tbl_testquestion,
(SELECT #r2 := 0) v -- var for row_number
where type_id=0
order by qu_id -- order by for row_number()
) oq
) q
) o
-- look at double join for questions and groups
join tbl_testquestion d on
((d.passage_description = o.passage_description) and (d.type_id=1))
or
((d.qu_id=o.qu_id) and (d.type_id=0))
order by rnd_ord
limit 40
and this is the more readable sql-server syntax:
;with
p as (
-- select a random number of groups (0-2) and label groups (-1,-2)
select top (abs(checksum(NEWID())) % 3) -ROW_NUMBER() over (order by passage_description) p_id, passage_description
from (
select distinct passage_description
from d
where type_id=1
) x
),
q as (
-- label questions (1..n)
select ROW_NUMBER() over (order by qu_id) q_ix, qu_id
from d
where type_id=0
),
o as (
-- calculate final order
select *, ROW_NUMBER() over (order by newid()) rnd_ord
from (
select p.q_ix, passage_description, 0 qu_id from p
union all
select q.q_ix, '', qu_id from q
) x
)
select top 40
d.*
, o.rnd_ord, o.q_ix
from o
join d on
((d.passage_description = o.passage_description) and (d.type_id=1))
or
((d.qu_id = o.qu_id) and (d.type_id=0))
order by
rnd_ord
that's all
I am running this query on my website in order to find a ToDo list based on specific criteria. But it runs too slow and it is probably possible to write it in another way.
SELECT * FROM lesson WHERE
id IN
(SELECT `lesson_id` FROM `localization_logging`
WHERE `language_id` = 2 AND `action_id` = 1)
AND `id` NOT IN
(SELECT `lesson_id` FROM `localization_logging`
WHERE `language_id` = 2 AND `part_id` = 1 AND `action_id` = 6)
What the query does is that it looks in the lesson table to find all lesson list names and then checks if a specific task is done. If the task is done in one todo than show it in the next. Action 1 is done but not action 6 in this case.
I hope I'm explaining this good enough. On my local machine the query takes 1.8 seconds, and sometimes I have to print multiple lists next to each others and then it takes 1.8 times the lists which makes the page load super slow.
Something like this for mark id as completed:
SELECT l.*, SUM(ll.action_id=6) completed FROM lesson l
INNER JOIN localization_logging ll ON ll.lesson_id = l.id
WHERE ll.language_id = 2 AND
(
ll.action_id = 1
OR
ll.action_id = 6 AND ll.part_id == 1
)
GROUP BY l.id
And now we can wrap it with:
SELECT t.* FROM (...) t WHERE t.completed = 0
You'll usually get faster queries filtering rows with INNER/LEFT JOIN, but you need to test it.
SELECT lesson.* FROM lesson
INNER JOIN localization_logging task1
ON lesson.id = task1.lesson_id
LEFT JOIN localization_logging task2
ON lesson.id = task2.lesson_id
AND task2.language_id = 2
AND task2.part_id = 1
AND task2.action_id = 6
WHERE task1.language_id = 2
AND task1.action_id = 1
AND task2.lesson_id IS NULL
Second table is joined on multiple conditions, but have to list them within ON clause because only results that were in result "force joined" as nulls (left join means left side stays no matter what) are required.
Btw. You'll get multiple rows from lesson if task1 condition is not limiting results to one row - GROUP BY lesson.id then.
I'm working with the join plus union plus group by query, and I developed a query something like mentioned below:
SELECT *
FROM (
(SELECT countries_listing.id,
countries_listing.country,
1 AS is_country
FROM countries_listing
LEFT JOIN product_prices ON (product_prices.country_id = countries_listing.id)
WHERE countries_listing.status = 'Yes'
AND product_prices.product_id = '3521')
UNION
(SELECT countries_listing.id,
countries_listing.country,
0 AS is_country
FROM countries_listing
WHERE countries_listing.id NOT IN
(SELECT country_id
FROM product_prices
WHERE product_id='3521')
AND countries_listing.status='Yes')) AS partss
GROUP BY id
ORDER BY country
And I just realised that this query is taking a lot of time to load results, almost 8 seconds.
I was wondering if there is the possibility to optimize this query to the fastest one?
If I understand the logic correctly, you just want to add a flag for the country as to whether or not there is a price for a given product. I think you can use an exists clause to get what you want:
SELECT cl.id, cl.country,
(exists (SELECT 1
FROM product_prices pp
WHERE pp.country_id = cl.id AND
pp.product_id = '3521'
)
) as is_country
FROM countries_listing cl
WHERE cl.status = 'Yes'
ORDER BY country;
For performance, you want two indexes: countries_listing(status, country) and
product_prices(country_id, product_id)`.
Depending on how often it is executed, prepared statements could help. See PDO for more information.
I am trying to use this to query 2 tables and get results based on factors from mainly one table. I would prefer doing 1 query instead of 1 query with many sub queries in a while or foreach.
SELECT a.request, a.city
FROM pages a, TNDB_CSV2 b
WHERE a.main_id = b.PerformerID
AND b.PCatID = '3'
AND a.catnum = '303'
AND a.city = b.City
AND b.TicketsYN = 'Y'
AND b.CountryID IN ('38', '217')
GROUP BY b.PerformerID, b.City HAVING COUNT(*) > 4
ORDER BY a.name ASC
So basically what this is saying is that I want to get results in 'pages' where records in 'TNDB_CSV2' have at least 4 matches of 'PerformerID' and 'City'.
The query works correctly, the issue is that it takes between 55-67 seconds to run which is massively way too long. Similar queries should take a fraction of a second. I have never grouped by 2 columns using HAVING and COUNT before so I am thinking there might be a much more efficient way of doing this.
The query currently returns 1,011 records and I looked to make sure that the conditions match the results and they do.
Here is your query, formatted with a proper join clause:
SELECT a.request, a.city
FROM pages a join
TNDB_CSV2 b
on a.main_id = b.PerformerID and a.city = b.City
WHERE b.PCatID = '3' AND
b.TicketsYN = 'Y' AND
b.CountryID IN ('38', '217') and
a.catnum = '303'
GROUP BY b.PerformerID, b.City
HAVING COUNT(*) > 4
ORDER BY a.name ASC;
You should be able to improve the performance of this query with indexes. Here are two that I can think of:
pages(catnum, main_id, city, name)
TNDB_CSV2(PerformerID, city, PCatID, TicketsYN, CountryID);