Multiple joins in a MySQL query giving incorrect results

Multiple joins in a MySQL query giving incorrect results - php

I've got a large mysql query with 5 joins which may not seem efficient but I'm struggling to find a different solution which would work.
The views table is the main table here, because both clicks and conversions table rely on it via the token column(which is indexed and set as a foreign key in all tables).
The query:
SELECT
var.id,
var.disabled,
var.name,
var.updated,
var.cid,
var.outdated,
IF(var.type <> 0,'DL','LP') AS `type`,
COUNT(DISTINCT v.id) AS `views`,
COUNT(DISTINCT c.id) AS `clicks`,
COUNT(DISTINCT co.id) AS `conversions`,
SUM(tc.cost) AS `cost`,
SUM(cp.value) AS `revenue`
FROM variants AS var
LEFT JOIN views AS v ON v.vid = var.id
LEFT JOIN traffic_cost AS tc ON tc.id = v.source
LEFT JOIN clicks AS c ON c.token = v.token
LEFT JOIN conversions AS co ON co.token = v.token
LEFT JOIN c_profiles AS cp ON cp.id = co.profile
WHERE var.cid = 28
GROUP BY var.id
The results I'm getting are:
The problem is the revenue and cost results are too hight, because for views,clicks and impressions only the distinct rows are counted, but for revenue and cost for some reason(I would really appreciate an explanation here) all rows in all tables are taken into the result set.
I know this is a large query, but both clicks and conversions tables rely on the views table which is used for filtering the results e.g. views.country = 'uk'. I've tried doing 3 queries and merging them, but that didn't work(it gave me wrong results).
One more thing that I find weird is that if I remove the joins with clicks, conversions, c_profiles the costs column shows correct results.
Any help would be appreciated.

In the end I had to use 3 different queries and do a merge on them. Seemed like an overhead, but worked for me.

Related

SQL to link three tables and SUM

I have three tables, tblPresents, tblPresentsOrdered and tblPresentsDelivered.
What I want to do is sum up all the orders and deliveries for a given present ID, so I can tally up the total ordered and delivered and check for discrepancies.
So far I have the following:
$sql ='SELECT prsName, SUM(ordQuantity) AS qtyOrdered,
SUM(delQuantity) AS qtyDelivered
FROM tblPresentOrders
LEFT JOIN tblPresentDeliveries
ON tblPresentDeliveries.delPresent = tblPresentOrders.ordPresent
RIGHT JOIN tblPresents ON tblPresents.prsID = tblPresentOrders.ordPresent
GROUP BY prsName';
The first column (Ordered) is summing up correctly, but the deliveries is counting the delivery twice (there are two separate orders for that line).
What am I doing wrong?

Because you can have multiple orders per delivery (and presumably multiple presents per order) you need to perform aggregation in derived tables before JOINing to avoid duplication in counted/summed values. Note that using a mixture of LEFT JOIN and RIGHT JOIN in the same query can be a bit hard to read so I've rewritten the query using only LEFT JOINs.
SELECT p.prsName, o.qtyOrdered, d.qtyDelivered
FROM tblPresents p
LEFT JOIN (SELECT ordPresent, SUM(ordQuantity) AS qtyOrdered
FROM tblPresentOrders
GROUP BY ordPresent) o ON o.ordPresent = p.prsID
LEFT JOIN (SELECT delPresent, SUM(delQuantity) AS qtyDelivered
FROM tblPresentDeliveries
GROUP BY delPresent) d ON d.delPresent = p.prsID

How to improve query performance (using explain command results f.e.)

I'm currently running this query. However, when run outside phpMyAdmin it causes a 504 timeout error. I'm thinking it has to do with how efficient the number of rows is returned or accessed by the query.
I'm not extremely experienced with MySQL and so this was the best I could do:
SELECT
s.surveyId,
q.cat,
SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
user.division_id,
user.unit_id,
user.department_id,
user.team_id,
division.division_name,
unit.unit_name,
dpt.department_name,
team.team_name
FROM survey_answers_cache sac
JOIN surveys s ON s.surveyId = sac.surveyid
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
JOIN user ON user.user_id = sac.user_id
JOIN questions q ON q.question_id = sac.question_id
JOIN division ON division.division_id = user.division_id
LEFT JOIN unit ON unit.unit_id = user.unit_id
LEFT JOIN department dpt ON dpt.department_id = user.department_id
LEFT JOIN team ON team.team_id = user.team_id
WHERE c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
GROUP BY user.team_id, s.surveyId, q.cat
ORDER BY s.surveyId, user.team_id, q.cat ASC
The problem I get with this query is that when I get a correct result returned it runs quickly (let's say +-500ms) but when the result has twice as much rows, it takes more than 5 minutes and then causes a 504 timeout.
The other problem is that I didn't create this database myself, so I didn't set the indices myself. I'm thinking of improving these and therefore I used the explain command:
I see a lot of primary keys and a couple double indices, but I'm not sure if this would affect the performance this greatly.
EDIT: This piece of code takes up all the execution time:
$start_time = microtime(true);
$stmt = $conn->query($query); //query is simply the query above.
while ($row = $stmt->fetch_assoc()){
$resultSurveys["scores"][] = $row;
}
$stmt->close();
$end_time = microtime(true);
$duration = $end_time - $start_time; //value typically the execution time #reallyHigh...
So my question: Is it possible to (greatly?) improve the performance of the query by altering the database keys or should I divide my query into multiple smaller queries?

You can try something like this ( although its not practical for me to test this )
SELECT
sac.surveyId,
q.cat,
SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
user.division_id,
user.unit_id,
user.department_id,
user.team_id,
division.division_name,
unit.unit_name,
dpt.department_name,
team.team_name
FROM survey_answers_cache sac
JOIN
(
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
) AS v ON v.surveyid = sac.surveyid
JOIN user ON user.user_id = sac.user_id
JOIN questions q ON q.question_id = sac.question_id
JOIN division ON division.division_id = user.division_id
LEFT JOIN unit ON unit.unit_id = user.unit_id
LEFT JOIN department dpt ON dpt.department_id = user.department_id
LEFT JOIN team ON team.team_id = user.team_id
GROUP BY user.team_id, v.surveyId, q.cat
ORDER BY v.surveyId, user.team_id, q.cat ASC
So I hope I didn't mess anything up.
Anyway, the idea is in the inner query you select only the rows you need based on your where condition. This will create a smaller tmp table as it only pulls 2 fields both ints.
Then in the outer query you join to the tables that you actually pull the rest of the data from, order and group. This way you are sorting and grouping on a smaller dataset. And your where clause can run in the most optimal way.
You may even be able to omit some of these tables as your only pulling data from a few of them, but without seeing the full schema and how it's related that's hard to say.
But just generally speaking this part (The sub-query)
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
Is what is directly affected by your WHERE clause. See so we can optimize this part then use it to join the rest of the data you need.
An example of removing tables can be easily deduced from the above, consider this
SELECT
s.surveyId,
sc.subcluster_id
FROM
surveys s
JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
WHERE
sc.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
The c table cluster is never used to pull data from, only for the where. So is not
JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
c.cluster_id=?
The same as or equivalent to
WHERE
sc.cluster_id=?
And therefore we can eliminate that join completely.

The EXPLAIN result is showing signs of problem
Using temporary;using filesort: the ORDER BY needs to create temporary tables to do the sorting.
On 3rd row for user table type is ALL, key and ref are NULL: means that it needs to scan the whole table each time to retrieve results.
Suggestions:
add indexes on user.cluster_id and all fields involved on the ORDER BY and GROUP by clauses. Keep in mind that user table seems to be under changein database (cross database query).
Add indexes on user columns involved on JOINs.
Add index to s.survey_id
If possible, keep the same sequence for GROUP BY and ORDER BY clauses
According to the accepted answer in this question move the JOIN on user table to the first position in the join queue.
Carefully read this official documentation. You may need to optimize the server configuration.
PS: query optimization is an art that requires patience and hard work. No silver bullet for that.
Welcome to the fine art of optimizing MySQL!

i think the problem happends when you add this:
JOIN user ON user.cluster_id = sc.subcluster_id
JOIN survey_answers_cache sac ON (sac.surveyId = s.surveyId AND sac.user_id = user.user_id)
the extra condition sac.user_id = user.user_id can be easily not consistent.
Can you try do a second join with user table?
pd. can you add a "SHOW CREATE TABLE"

MYSQL combining results into single item for display

I'm currently having an issue where I run a query on multiple tables getting the results, but they are all being considered independent. I've tried a couple ways of combining them, but because my SQL knowledge is limited I can't seem to get what I want to happen.
SELECT DISTINCT t.*, s.quantity, s.rrp, ts.thumbnail, ts.bigpic, t.rating
FROM tyres t
INNER JOIN stocklevels s
ON t.stockcode = s.stockcode
LEFT JOIN tyre_treads ts
ON t.treadid = ts.recid
LEFT JOIN reseller r
ON s.city=r.recid
WHERE s.quantity> 0 AND s.rrp > 0
I've tried adding GROUP BY t.recid and a couple other basic solutions but this doesn't seem to work. I've added a couple images which might help.
As you can see the bottom Toyo tyres are the same, just with varying cities and quantities.
Here they are on the website.
I'm wanting to combine they so that they say minimum 6 in stock and shows only once on the site.

As long as there is at least one column in your SQK resukt, which contains different values (like city in your example for "the same" tire, group by won't work. You must adapt your SQL statement in a way, that it only picks columns with the same values. Especially, you should remove the t.* from your sql and name all columns (You then will not need the distinct anymore).
Then, you sum over quantity to get the combined value for this column as wanted.
SELECT r.recid, sum(s.quantity), s.rrp, ts.thumbnail, ts.bigpic, t.rating
FROM tyres t
INNER JOIN stocklevels s
ON t.stockcode = s.stockcode
LEFT JOIN tyre_treads ts
ON t.treadid = ts.recid
LEFT JOIN reseller r
ON s.city=r.recid
WHERE s.quantity> 0 AND s.rrp > 0
GROUP BY recid

Multiple table JOIN Row count

Ok, I have a slightly complicated MySQL query
SELECT
childcare_attendance.supplier_id,
suppliers.name As `Childcare Provider`,
Count(families.ufi) As Attended,
families.ufi
FROM
childcare_attendance
LEFT OUTER JOIN
suppliers ON suppliers.id = childcare_attendance.supplier_id
LEFT OUTER JOIN
clients ON childcare_attendance.client_id = clients.id
LEFT OUTER JOIN
families ON clients.family_id = families.id
WHERE
childcare_attendance.trashed <> 1
GROUP BY
childcare_attendance.supplier_id, families.ufi
My issue is I want to create a report that lists the number of unique families that are attending each of the childcare places. I assumed that the above query would perform the task, although the childcare providers are showing multiple times and I am unsure why.
EDIT
Here is a SQL Fiddle without data (I will work on getting some test data in there)
http://sqlfiddle.com/#!9/7ef8e8

Improving MySQL query performance - possible index problems

I've seen many questions like mine, but after reading them all I got rather confused.
To sum up - I've got a query that select products from a table and adds more information about them from other tables.
Query:
SELECT
p.product_id,
p.product_name,
p.product_seo_url,
p.product_second_name,
p.product_intro_plain,
p.product_price,
p.product_price_promo,
p.product_promo_expire_date,
p.product_views,
p.product_code,
p.product_exquisite,
p.product_rating,
p.product_votes,
p.product_date_added,
p.product_returned,
p.product_price_returned,
( SELECT gal.image_filelocation
FROM 3w_products_gallery gal
WHERE gal.product_id = p.product_id
ORDER BY show_order ASC
LIMIT 1 ) image_filelocation,
m.man_image_location,
m.man_name,
m.man_seo_url
FROM
3w_products p
LEFT JOIN 3w_manufacturers m
ON p.man_id = m.man_id
LEFT JOIN 3w_products_cat_rel pcr
ON p.product_id = pcr.product_id
WHERE
pcr.ctg_id = '19'
AND p.man_id = '190'
ORDER BY
p.product_id DESC
LIMIT
0, 24
The strange things that happen are that the query sometimes executes for 0.001 sec. and sometimes for 30+ seconds.
There is what EXPLAIN shows:
http://i.stack.imgur.com/ZNtBX.png
I assume the problem lies in the indexes of the tables. Can you tell me how to setup them?
Let me know if you need any more information about the tables or whatever!
Best,
Dimitar

IF your "ID" columns are actually numeric, remove the quotes around them implying strings... even though it would to an implied conversion. If numeric, keep it numeric.
As stated by another in the comments, your LEFT JOIN via the "pcr" alias with its criteria in the WHERE clause turns this into an inner join.
FROM
3w_products p
LEFT JOIN 3w_manufacturers m
ON p.man_id = m.man_id
LEFT JOIN 3w_products_cat_rel pcr
ON p.product_id = pcr.product_id
AND pcr.ctg_id = 19
WHERE
AND p.man_id = 190
Field-level queries can kill performance since the select for each field (your image location) is done once for every record. To at least help this performance, table 3w_products_gallery should have an index ON ( product_id, show_order )
Your main 3w_products table should have an index on (man_id, product_id )... The Man_ID to optimize the WHERE clause by manufacturer, but also the product ID to help optimize the ORDER BY criteria.
Your 3w_manufacturers table, I would suspect already has a valid index on (man_id), since it appears to be the primary key to the table.
Additionally, being web-based content, you might be better to DE-NORMALIZE your product table by adding one new column for "GalleryShowOrder". Then, add a trigger to your Gallery table that any insert or update will then push the first "showOrder" value back to the product table. This way, when you query, you can just add another join to that table on the product and KNOWN show order. If your gallery is returning 1000's of records, even though you are only limiting to 24, it still needs to get all record before the order by is applied. Thus 1000's of subqueries for each of the gallery images.
and your field selection would just become
gal.image_filelocation,
and your JOIN would add the following
LEFT JOIN 3w_products_gallery gal
on p.product_id = gal.product_id
AND p.GalleryShowOrder = gal.show_order

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.