MySQL: Joining a table to itself, eliminating duplicate rows - php

I have a query that connects a table to itself. The results contain duplicate rows (sort of). The objective of this query is to produce a list of products most frequently purchased together. Consider this query:
SELECT o1.ITEM
,o2.ITEM as ITEM2
,o3.ITEM AS ITEM3
,count(DISTINCT o1.ORDERNUM) as oCount
FROM orders o1
INNER JOIN orders o2 ON o2.ORDERNUM = o1.ORDERNUM AND o2.ITEM != o1.ITEM
LEFT OUTER JOIN orders o3 ON o3.ORDERNUM = o1.ORDERNUM AND o3.ITEM != o2.ITEM AND o3.ITEM != o1.ITEM
GROUP BY o1.ITEM, o2.ITEM, o3.ITEM
ORDER BY oCount DESC
And the first 12 results:
+-------------+-------------+-------------+--------+
| ITEM | ITEM2 | ITEM3 | oCount |
+-------------+-------------+-------------+--------+
| 02B13.04.GP | 77A04.10 | 45A04.04.GP | 54 |
| 02B13.04.GP | 45A04.04.GP | 77A04.10 | 54 |
| 77A04.10 | 45A04.04.GP | 02B13.04.GP | 54 |
| 45A04.04.GP | 02B13.04.GP | 77A04.10 | 54 |
| 77A04.10 | 02B13.04.GP | 45A04.04.GP | 54 |
| 45A04.04.GP | 77A04.10 | 02B13.04.GP | 54 |
| 57B01.01.GP | 57B01.11.GP | 57B01.10.GP | 12 |
| 57B01.10.GP | 57B01.11.GP | 57B01.01.GP | 12 |
| 57B01.01.GP | 57B01.10.GP | 57B01.11.GP | 12 |
| 57B01.10.GP | 57B01.01.GP | 57B01.11.GP | 12 |
| 57B01.11.GP | 57B01.10.GP | 57B01.01.GP | 12 |
| 57B01.11.GP | 57B01.01.GP | 57B01.10.GP | 12 |
Note that the first 6 results are the same connections, in a different order. The second 6 results have the same issue (and this continues throughout the results). My goal is to have a single record for each item group, not a single row for each combination of each item group.
How can I avoid these repeated results?
Also any advice on a more efficient approach to this query would be welcome (I'd like to add an additional join, but with 1,000,000 orders the resource requirements are getting out of hand).
================================================
EDIT: To answer Darshan's questions
Can you share the table structure:
The table contains the lines for all the orders. If an order contains multiple products, there will be a line for each product (multiple lines for a given order). The only columns of concern in this query are:
ORDERNUM CHAR : Order Number
ITEM CHAR : SKU for the item
QTY INT : Quantity purchased
ORDDATE DATETIME : Order Date
Results returned: All I need is what I listed in the result sample above. The objective is to get a list of the products that are purchased together the most often.

What you want to do is to eliminate duplicated rows regardless of the position; one trick, since you always have all the combinations of items is to filter the results according to a predicate that says item1 < item2 < item3
Here is a possible solution:
SELECT a.item, b.item, c.item, count(*)
from `orders` a left join orders b
on a.ordernum = b.ordernum and a.item <> b.item
left join orders c on a.ordernum = c.ordernum
and a.item <> c.item and b.item <> c.item
where a.item < b.item and b.item < c.item
group by a.item, b.item, c.item
order by count(*) desc

Related

MySQL SUM of multiple rows from multiple table

I am trying to get the sum of multiple rows from 2 different tables, but somehow the result returns multiple rows.
I need to get the SUM of quotation_item_amount (group by quotation_id) and invoice_item_amount (group by invoice_id) and if I query unpaid quotation, I need to get WHERE SUM(invoice) < SUM(quotation)
So here's my sample table
table client_project_id
+-------------------+-----------+----------------------+
| client_project_id | client_id | client_project_title |
+-------------------+-----------+----------------------+
| 23 | 5 | Project 1 |
| 17 | 9 | Project 2 |
| 54 | 7 | Project 3 |
+-------------------+-----------+----------------------+
table quotation
+--------------+-------------------+------------------+
| quotation_id | client_project_id | quotation_number |
+--------------+-------------------+------------------+
| 1 | 23 | Q/01/2020/001 |
| 2 | 17 | Q/01/2020/002 |
| 3 | 54 | Q/01/2020/003 |
+--------------+-------------------+------------------+
table quotation_item
+-------------------+--------------+-----------------------+
| quotation_item_id | quotation_id | quotation_item_amount |
+-------------------+--------------+-----------------------+
| 1 | 1 | 500 |
| 2 | 1 | 700 |
| 3 | 1 | 600 |
| 4 | 2 | 200 |
| 5 | 2 | 150 |
| 6 | 3 | 900 |
+-------------------+--------------+-----------------------+
table invoice
+--------------+-------------------+------------------+
| invoice_id | client_project_id | invoice_number |
+--------------+-------------------+------------------+
| 1 | 23 | I/01/2020/001 |
| 2 | 17 | I/01/2020/002 |
| 3 | 54 | I/01/2020/003 |
+--------------+-------------------+------------------+
table invoice_item
+-------------------+--------------+-----------------------+
| invoice_item_id | invoice_id | invoice_item_amount |
+-------------------+--------------+-----------------------+
| 1 | 1 | 500 |
| 2 | 1 | 700 |
| 3 | 1 | 600 |
| 4 | 2 | 200 |
| 5 | 2 | 150 |
| 6 | 3 | 900 |
+-------------------+--------------+-----------------------+
The result that I need to obtain is:
SUM of quotation_item_amount and SUM of invoice_item_amount PER client_project_id
To query WHERE SUM(invoice) < SUM(quotation)
Here is my latest try at the query
SELECT
SUM(quotation_item.quotation_item_amount) as quot_amt,
SUM(invoice_item.invoice_item_amount) as inv_amt,
data_client_project.client_project_id,
data_client.client_name
FROM data_client_project a
LEFT JOIN quotation b ON a.client_project_id = b.client_project_id
LEFT JOIN data_client d ON a.client_id = d.client_id
LEFT JOIN invoice i ON a.client_project_id = i.client_project_id
JOIN (
SELECT quotation_id,
SUM(c.quotation_item_amount) as quot_amt
FROM quotation_item c
GROUP BY c.quotation_id
) quotitem
ON b.quotation_id = quotitem.quotation_id
JOIN (
SELECT invoice_id,
SUM(e.invoice_item_price) as inv_amt
FROM invoice_item e
GROUP BY e.invoice_id
) invitem
ON i.invoice_id = invitem.invoice_id
However, this results in multiple duplicate rows of the quotation_item_amount and invoice_item_amount.
Have tried using UNION / UNION ALL and several other queries which just do not work.
Thank you for all your suggestions.
It looks like you are trying to aggregate along two different dimensions at the same time. The solution is to pre-aggregate along each dimension:
SELECT *
FROM data_client_project cp LEFT JOIN
(SELECT q.client_project_id,
SUM(qi.quotation_item_amount * qi.quotation_item_qty) as quot_amt
FROM quotation q JOIN
quotation_item qi
ON qi.quotation_id = q.quotation_id
GROUP BY q.client_project_id
) q
USING (client_project_id) LEFT JOIN
(SELECT i.client_project_id,
SUM(invoice_item_price) as inv_amt
FROM invoice i JOIN
invoice_item ii
ON i.invoice_id = ii.invoice_id
GROUP BY i.client_project_id
) i
USING (client_project_id);
Two notes about your style.
First, you are using arbitrary letters for table aliases. This makes the query quite hard to follow and becomes quite awkward if you add new tables, remove tables, or rearrange the names. Use abbreviations for the tables. Much easier to follow.
Second, I don't really recommend SELECT * for such queries. But, you can avoid duplicated column by replacing ON with USING.
I may be missing something, but your table descriptions do not include a example for data_client or data_client_project Given your example, I expect your row expansion is coming from the first 3 joins.
Make sure that the below is giving you the list of data you want first, then try joining in the calculation:
SELECT *
FROM data_client_project a
LEFT JOIN quotation b ON a.client_project_id = b.client_project_id
LEFT JOIN data_client d ON a.client_id = d.client_id
LEFT JOIN invoice i ON a.client_project_id = i.client_project_id;
#you may want to append the above with a limit 100 for testing.
if you have duplicated rows form the main query then add distinct for obatin a only distinct rows
and andd the where conditio for filtering the result by quotitem.quot_amt < invitem.inv_amt
SELECT distinct a.*, b.*, d.*, i.*
FROM data_client_project a
LEFT JOIN quotation b ON a.client_project_id = b.client_project_id
LEFT JOIN data_client d ON a.client_id = d.client_id
LEFT JOIN invoice i ON a.client_project_id = i.client_project_id
JOIN (
SELECT quotation_id,
SUM(c.quotation_item_amount * c.quotation_item_qty) as quot_amt
FROM quotation_item c
GROUP BY c.quotation_id
) quotitem ON b.quotation_id = quotitem.quotation_id
JOIN (
SELECT invoice_id,
SUM(e.invoice_item_price) as inv_amt
FROM invoice_item e
GROUP BY e.invoice_id
) invitem ON i.invoice_id = invitem.invoice_id
WHERE quotitem.quot_amt < invitem.inv_amt

GROUP BY multiple conditions at once

I have a tables like this:
Users
+----+----------+-------------+
| id | name | other_stuff |
+----+----------+-------------+
| 1 | John Doe | x |
| 2 | Jane Doe | y |
| 3 | Burt Olm | z |
+----+----------+-------------+
Places
+----+------------+-------------+
| id | name | other_stuff |
+----+------------+-------------+
| 1 | Building A | x |
| 2 | Building B | y |
+----+------------+-------------+
Subjects
+----+------------+-------------+
| id | name | other_stuff |
+----+------------+-------------+
| 1 | Math | x |
| 2 | English | y |
+----+------------+-------------+
And a joining table:
PastLectures = lectures that took place
+----+-----------+----------+------------+---------+------------+
| id | id_users | id_place | id_subjects| length | date |
+----+-----------+----------+------------+---------+------------+
| 1 | 1 | 1 | 1 | 60 | 2015-10-25 |
| 2 | 1 | 1 | 1 | 120 | 2015-11-06 |
| 3 | 2 | 2 | 2 | 120 | 2015-11-04 |
| 4 | 2 | 2 | 1 | 60 | 2015-11-10 |
| 5 | 1 | 2 | 1 | 60 | 2015-11-10 |
| 6 | 2 | 2 | 1 | 40 | 2015-11-15 |
| 7 | 1 | 2 | 2 | 30 | 2015-11-15 |
+----+-----------+----------+------------+---------+------------+
I would like to display SUM of all lessons for each user for given month. The SUM should by grouped by each Places and Subjects.
The result in final PHP output should look like this:
November 2015
+------------+-------------+---------------+-------------+
| Users.name | Places.name | Subjects.name | sum(length) |
+------------+-------------+---------------+-------------+
| Burt Olm | - | - | - |
| Jane Doe | Building B | Math | 100 |
| = | = | English | 120 |
| John Doe | Building A | Math | 120 |
| = | Building B | Math | 60 |
| = | = | English | 30 |
+------------+-------------+---------------+-------------+
I have tried creating the full output in pure SQL query using multiple GROUP BY (Group by - multiple conditions - MySQL), but when I do GROUP BY User.id,Places.id it shows each user only once (3 results) no matter the other GROUP BY conditions.
SQL:
SELECT PastLectures.id_users,Users.name AS user,Places.name AS places,Subjects.name AS subjects
FROM PastLectures
LEFT JOIN Users ON PastLectures.id_users = Users.id
LEFT JOIN Places ON PastLectures.id_Places = Places.id
LEFT JOIN Subjects ON PastLectures.id_Subjects = Subjects.id
WHERE date >= \''.$monthStart->format('Y-m-d H:i:s').'\' AND date <= \''.$monthEnd->format('Y-m-d H:i:s').'\'
GROUP BY Users.id,Places.id
ORDER BY Users.name,Places.name,Subjects.name
But I don't mind if part of the solution is done in PHP, I just don't know what to do next.
EDIT:
I also have a table Timetable, that stores who regularly teaches what and where. It stores only used combinations of the tables (each valid combination once).
Timetable = lectures that regularly take place
+----+-----------+----------+------------+-------------+
| id | id_users | id_place | id_subjects| other_stuff |
+----+-----------+----------+------------+-------------+
| 1 | 1 | 1 | 1 | x |
| 2 | 1 | 2 | 1 | y |
| 3 | 1 | 2 | 2 | z |
| 4 | 2 | 2 | 1 | a |
| 5 | 2 | 2 | 2 | b |
+----+-----------+----------+------------+-------------+
Is it possible to add only users with combinations that have a row in this table?
In this case it would mean omitting Burt Olm (no id=3 in Timetable). But if Burt has a Timetable entry and still no PastLectures entry, he would show here as in sample result (he should have had a lecture that month, because he is in Timetable, but no lectures took place).
Based on #Barmar's solution I updated the final SQL by making Timetable a primary table and adding one more LEFT JOIN to suffice those needs.
Final SQL:
SELECT Users.name AS user,Places.name AS places,Subjects.name AS subjects, SUM(PastLectures.length)
FROM Timetable
LEFT JOIN PastLectures ON PastLectures.id_users = Timetable.id_users AND PastLectures.id_place = Timetable.id_place AND PastLectures.id_subjects = Timetable.id_subjects
AND date BETWEEN '2015-11-01 00:00:00' AND '2015-11-30 23:59:59'
LEFT JOIN Places ON Timetable.id_Place = Places.id
LEFT JOIN Subjects ON Timetable.id_Subjects = Subjects.id
LEFT JOIN Users ON Timetable.id_users = Users.id
GROUP BY Timetable.id,Timetable.id_users,Timetable.id_Place,Timetable.id_Subjects
ORDER BY Users.name,Places.name,Subjects.name
You need to include Subjects.id in the GROUP BY, so you get a separate result for each subject.
Also, you shouldn't use columns in tables that are joined with LEFT JOIN in the GROUP BY column. If you do that, all the non-matching rows will be grouped together, because they all have NULL in that column. Use the columns in the main table.
GROUP BY PastLectures.id_users, PastLectures.id_Place, PastLectures.id_Subjects
DEMO
Note that there's no row for Burt Olm in the demo output, because all his rows are filtered out by the WHERE clause. If you want all users to be shown, you should make Users the main table, not PastLectures. And the date criteria needs to be moved into the ON clause when joining with PastLectures.
SELECT Users.name AS user,Places.name AS places,Subjects.name AS subjects, SUM(length)
FROM Users
LEFT JOIN PastLectures ON PastLectures.id_users = Users.id
AND date BETWEEN '2015-11-01 00:00:00' AND '2015-11-30 23:59:59'
LEFT JOIN Places ON PastLectures.id_Place = Places.id
LEFT JOIN Subjects ON PastLectures.id_Subjects = Subjects.id
GROUP BY Users.id, PastLectures.id_Place, PastLectures.id_Subjects
ORDER BY Users.name,Places.name,Subjects.name
DEMO
According to standard SQL, you should GROUP BY all the fields you select, except for the aggregated fields (like sum). Althought MySql allows to do otherwise, when it can be done adhering to the standards, it is better to do so (who knows when you need to port your code to another database engine). So write your SQL like this:
SELECT PastLectures.id_users,
Users.name AS user,
Places.name AS places,
Subjects.name AS subjects,
Sum(length)
FROM PastLectures
LEFT JOIN Users ON PastLectures.id_users = Users.id
LEFT JOIN Places ON PastLectures.id_Places = Places.id
LEFT JOIN Subjects ON PastLectures.id_Subjects = Subjects.id
WHERE date BETWEEN \''.$monthStart->format('Y-m-d H:i:s').'\'
AND \''.$monthEnd->format('Y-m-d H:i:s').'\'
GROUP BY PastLectures.id_users,
Users.name,
Places.name,
Subjects.name
ORDER BY Users.name,
Places.name,
Subjects.name

Select distinct and random rows from one table that match a value from another table

This topic has been much discussed but I was unable to find a solution that I can modify and make it work for my case. So maybe a more advanced expert will be able to help out.
I have a table called keywords which contains about 3000 rows with distinct keywords. Against each keyword there is a matching product_id, which are NOT unique, i.e. some of them are repeated. Table looks something like this:
+---------+------------+
| keyword | product_id |
+---------+------------+
| apple1 | 15 |
| apple2 | 15 |
| pear | 205 |
| cherry | 307 |
| melon | 5023 |
+---------+------------+
I have a second table called inventory that contains about 500K of products each with it's own product ID and other product data.
Now I need to get one random product row from inventory table that matches each product_id from keywords table and insert those rows into another table.
Resulting table should be something like this:
+---------+------------+---------+---------+---------+
| keyword | product_id | product | data1 | data2 |
+---------+------------+---------+---------+---------+
| apple1 | 15 | app5 | d1 | d2 |
| apple2 | 15 | app1 | d1 | d2 |
| pear | 205 | pear53 | d1 | d2 |
| cherry | 307 | cher74 | d1 | d2 |
| melon | 5023 | melo2 | d1 | d2 |
+---------+------------+---------+---------+---------+
This is my query at the moment and the problem is how to get a random product from inventory that matches a product_id:
SELECT keywords.keyword, keywords.product_id, inventory.*
FROM keywords LEFT OUTER JOIN
inventory
ON keywords.product_id = inventory.id
ORDER BY RAND();
If you want it to only return rows when there is a match between the tables, then you want a regular (i.e. inner) join not a left outer join. You can also add the word distinct.
SELECT DISTINCT keywords.keyword, keywords.product_id, inventory.*
FROM keywords JOIN
inventory
ON keywords.product_id = inventory.id
ORDER BY RAND();
And if you only want 1 row returned, add limit 1 at the end.
SELECT keywords.keyword, keywords.product_id, inventory.*
FROM keywords JOIN
inventory
ON keywords.product_id = inventory.id
ORDER BY RAND() LIMIT 1;
Is this what you want?
SELECT *
FROM (
SELECT keywords.keyword, keywords.product_id, inventory.*
FROM keywords JOIN
inventory
ON keywords.product_id = inventory.id
ORDER BY RAND()
) tmp
GROUP BY tmp.keyword;
I also test it at http://sqlfiddle.com/#!2/e559a9/2/0. Just run some times, the result will be randomize.

Data from three tables duplicates when a JOIN statement is used

when I retrieve data from my tables using JOIN, the rows duplicates. The tables are three in number.
Students
--------
StuID | Name |
1 | Appiah John |
2 | Minister A |
Levels
------
| LevelID | Level | Year | StuID |
| 08 | 100 | 2010 | 2 |
| 83 | 200 | 2011 | 1 |
| 45 | 200 | 2011 | 2 |
Ranks
-----
| RankID | Rank | StuID |
| 101 | 1st | 1 |
| 404 | 4th | 2 |
This is my query statement to select some data from the three tables
SELECT
m.StuID,
n.Level,
n.Year,
o.Rank
FROM
Students m
INNER JOIN
Levels n
ON
m.StuID=n.StuID
INNER JOIN
Ranks o
ON
m.StuID=o.StuID
WHERE
m.StuID=2;
OUTPUT
The query above produces a duplicate answer
| StuID | Level | Year |Rank |
| 2 | 100 | 2010 | 4th |
| 2 | 200 | 2011 | null |
| 2 | 100 | 2010 | 4th |
| 2 | 200 | 2011 | null |
DESIRED OUTPUT
I therefore wish that the output would be like below
| StuID | Level | Year |Rank |
| 2 | 100 | 2010 | 4th |
| 2 | 200 | 2011 | null |
QUESTIONS
Where am I going wrong?
Is join the best way to select data from three tables like this?
How can I make a query to get the desired output?
Believe it or not I think the comma between Students m and INNER JOIN is doing it. You're selecting from two separate tuples now, joined on any clause rather than joining the first table to the second to the third.
Try doing a left join instead of an inner join:
SELECT m.StuID,
n.Level,
n.Year,
o.Rank
FROM Students m
LEFT JOIN Levels n ON (m.StuID = n.StuID)
LEFT JOIN Ranks o ON (m.StuID = o.StuID)
WHERE m.StuID = 2
How about using select distinct m.StuID?
You can try something like this
SELECT distinct m.StuID, n.Level, n.Year, o.Rank
FROM Students m INNER JOIN Levels n ON m.StuID=n.StuID
INNER JOIN Ranks o
ON m.StuID=o.StuID WHERE m.StuID=2;

Need help with a MySQL statement

I have a table of Products that looks like so:
| id | Description | Price |
| 1 | dinglehopper | 2.99 |
| 2 | flux capacitor | 48.99 |
| 3 | thing1 | 48.99 |
And so on...
Then I have an OrderLineItem table which, as you can guess, links each item in an order to the product:
| id | productID | OrderID |
| 43 | 1 | 12 |
| 44 | 2 | 12 |
| 52 | 3 | 15 |
So, as you can see, order #12 contains a dinglehopper and flux capacitor. How can I get this information in a single query? I just want ALL the products associated with a given OrderID in the OrderLineItem table.
May be by
select p.description,p.id,o.irderId
from
`orderLineItem` o, `product` p
where
p.id = o.productId;
or
select p.description,p.id,o.irderId
from `orderLineItem` o
join `product` p
on p.id = o.productId;
LEFT JOIN :)
http://www.w3schools.com/sql/sql_join_left.asp
#Pete About "single" query part, you should make VIEW from this join, if really going to use a lot.

Categories