Retail inventory Mysql query optimization - php

Given the following tables for a retail administration system:
STORES: store_id, name
PRODUCTS: product_id, name, cost
PRODUCT_ENTRIES: key, store_id, date
PRODUCT_ENTRIES_CONTENT: product_entries_key, product_id, quantity
PRODUCT_EXITS: key, store_id, product_id, quantity, status, date
SALES: key, store_id, date
SALES_CONTENT: sales_key, product_id, quantity
RETURNS: key, store_id, date
RETURNS_CONTENT: returns_key, product_id, quantity
In order to calculate stock values I run through the contents of the products table and for each product_id:
Sum quantities of product_entries_content as well as returns_content
Subtract quantities of product_exits_content (where status = 2 or 3) as well as sales_content
To calculate the cost of the inventory of each store, I'm running the following query through a PHP loop for each distinct store and outputting the result:
SELECT
SUM((((
(SELECT COALESCE(SUM(product_entries_content.quantity), 0)
FROM product_entries
INNER JOIN product_entries_content ON
product_entries_content.product_entries_key = product_entries.key
WHERE product_entries_content.product_id = products.id
AND product_entries.store_id = '.$row['id'].'
AND DATE(product_entries.date) <= DATE(NOW()))
-
(SELECT COALESCE(SUM(quantity), 0)
FROM sales_content
INNER JOIN sales ON sales.key = sales_content.sales_key
WHERE product_id = products.product_id AND sales.store_id = '.$row['id'].'
AND DATE(sales_content.date) <= DATE(NOW()))
+
(SELECT COALESCE(SUM(quantity), 0)
FROM returns_content
INNER JOIN returns ON returns.key = returns_content.returns_key
WHERE product_id = products.product_id AND returns.store_id = '.$row['id'].'
AND DATE(returns.date) <= DATE(NOW()))
-
(SELECT COALESCE(SUM(quantity), 0)
FROM product_exits
WHERE product_id = products.product_id AND (status = 2 OR status = 3)
AND product_exits.store_id = '.$row['id'].' #store_id
AND DATE(product_exits.date) <= DATE(NOW()))
) * products.cost) / 100) ) AS "'.$row['key'].'" #store_name
FROM products WHERE 1
All foreign keys and indexes are properly set. The problem is because of the large amount of stores and movements in each store the query is becoming increasingly heavy, and because inventory is calculated from the beginning of each store's history it only gets slower with time.
What could I do to optimize this scheme?

Ideally, SHOW CREATE TABLE tablename for each table would really help a lot in any optimization question. The data type of each column is EXTREMELY important to performance.
That said, from the information you've given the following should be helpful, assuming the column data types are all appropriate.
Add the following indexes, if they do not exist. IMPORTANT: Single column indexes are NOT valid replacements for the following composite indexes. You stated that
All foreign keys and indexes are properly set.
but that tells us nothing about what they are, and if they are "proper" for optimization.
New indexes
ALTER TABLE sales
CREATE INDEX `aaaa` (`store_id`,`key`)
ALTER TABLE sales_content
CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
ALTER TABLE returns
CREATE INDEX `cccc` (`store_id`,`date`,`sales_key`)
ALTER TABLE returns_content
CREATE INDEX `dddd` (`product_id`,`returns_key`,`quantity`)
ALTER TABLE product_exits
CREATE INDEX `eeee` (`product_id`,`status`,`store_id`,`date`,`quantity`)
ALTER TABLE product_entries
CREATE INDEX `ffff` (`store_id`,`date`,`key`)
ALTER TABLE product_entries_content
CREATE INDEX `gggg` (`product_id`,`product_entries_key`,`quantity`)
(Use more appropriate names than aaaa. I just used those to save time.)
Each of the above indexes will allow the database to read only one row for each table. Most performance issues involving joins comes from what is known as a double lookup.
Understanding indexes and double lookups
An index is just a copy of the table data. Each column listed in the index is copied from the table, in the order listed in the index, and then the primary key is appended to that row in the index. When the database uses an index to look up a value, if not all the information is contained in the index, the primary key will be used to access the clustered index of the table to obtain the rest of the information. This is what a double look up is, and it is VERY bad for performance.
Example
All the above indexes are designed to avoid double lookups. Let's look at the second subquery to see how the indexes related to that query will work.
ALTER TABLE sales
CREATE INDEX `aaaa` (`store_id`,`key`)
ALTER TABLE sales_content
CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
Subquery (I added aliases and adjusted how the date column is accessed, but otherwise it is unchanged):
SELECT COALESCE(SUM(sc.quantity), 0)
FROM sales_content sc
INNER JOIN sales s
ON s.key = sc.sales_key
WHERE sc.product_id = p.product_id
AND s.store_id = '.$row['id'].'
AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
Using the aaaa index, the database will be able to look up only those rows in the sales table that match the store_id, since that is listed first in the index. Think of this in the same way as a phone book, where store_id is the last name, and key is the first name. If you have the last name, then it is EXTREMELY easy to flip to that point of the phone book, and quickly get all the first names that go with that last name. Likewise, the database is able to very quickly "flip" to the part of the index that contains the given store_id value, and find all the key values. In this case, we do not need the primary key at all (which would be the phone number, in the phone book example.)
So, done with the sales table, and we have all the key values we need from there.
Next, the database moves onto the bbbb index. We already have product_id from the main query, and we have the sales_key from the aaaa index. That is like having both first and last name in the phone book. The only thing left to compare is the date, which could be like the address in a phone book. The database will store all the dates in order, and so by giving it a cutoff value, it can just look at all the dates up to a certain point.
The last part of the bbbb index is the quantity, which is there so that the database can quickly sum up all those quantities. To see why this is fast, consider again the phone book. Imagine in addition to last name, first name, and address information, that there is also a quantity column (of something, it doesn't matter what). If you wanted the sum of the quantities for a specific last name, first name, and for all addresses that start with the number 5 or less, that is easy, isn't it? Just find the first one, and add them up in order until you reach the first address that starts with a number greater than 5. The database benefits the same way when using the date column in this way (date is like the address column, in this example.)
The date columns
Finally, I noted earlier, I changed how the date column was accessed. You never want to run a function on a database column that you are comparing to another value. The reason is this: What would happen if you had to convert all the addresses into roman numerals, before you did any comparison? You wouldn't be able to just go down the list like we did earlier. You'd have to convert ALL the values, and THEN check each one to make sure it was within the limit, since we no longer know if the values are sorted correctly to just be able to do the "read them all and then stop at a certain value" shortcut I described above.
You and I may know that converting a datetime value to a date isn't going to change the order, but the database will not know (it might be possible it optimizes this conversion, but that's not something I want to assume.) So, keep the columns pure. The change I made was to just take the NOW() date, and add one day, and then make it a < instead of a <=. After all, comparing two values and saying the date must be equal to or less than today's date is equivalent to saying the datetime must be less than tomorrow's date.
The query
Below is my final query for you. As stated, not much has changed other than the date change and aliases. However, you had a typo in the first subquery where you accessed products.id. I corrected the id to be product_id, given that that matches what you stated were the columns for the products table.
SELECT
SUM(
(
(
(
(
SELECT COALESCE(SUM(pec.quantity), 0)
FROM product_entries pe
INNER JOIN product_entries_content pec
ON pec.product_entries_key = pe.key
WHERE pec.product_id = p.product_id
AND pe.store_id = '.$row['id'].'
AND pe.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
-
(
SELECT COALESCE(SUM(sc.quantity), 0)
FROM sales_content sc
INNER JOIN sales s
ON s.key = sc.sales_key
WHERE sc.product_id = p.product_id
AND s.store_id = '.$row['id'].'
AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
+
(
SELECT COALESCE(SUM(rc.quantity), 0)
FROM returns_content rc
INNER JOIN returns r
ON r.key = rc.returns_key
WHERE rc.product_id = p.product_id
AND r.store_id = '.$row['id'].'
AND r.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
-
(
SELECT COALESCE(SUM(pex.quantity), 0)
FROM product_exits pex
WHERE pex.product_id = p.product_id
AND (pex.status = 2 OR pex.status = 3)
AND pex.store_id = '.$row['id'].' #store_id
AND pex.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
)
* p.cost)
/ 100)
) AS "'.$row['key'].'" #store_name
FROM products p WHERE 1
You may be able to further optimize this by splitting the subquery on the product_exits table into 2 separate sub queries, rather than using a OR, which many times will perform poorly. Ultimately, you'll have to benchmark that to see how well the database optimizes the OR on its own.

Related

Pagination Offset Issues - MySQL

I have an orders grid holding 1 million records. The page has pagination, sort and search options. So If the sort order is set by customer name with a search key and the page number is 1, it is working fine.
SELECT * FROM orders WHERE customer_name like '%Henry%' ORDER BY
customer_name desc limit 10 offset 0
It becomes a problem when the User clicks on the last page.
SELECT * FROM orders WHERE customer_name like '%Henry%' ORDER BY
customer_name desc limit 10 offset 100000
The above query takes forever to load. Index is set to the order id, customer name, date of order column.
I can use this solution https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/ if I don't have a non-primary key sort option, but in my case sorting is user selected. It will change from Order id, customer name, date of order etc.
Any help would be appreciated. Thanks.
Problem 1:
LIKE "%..." -- The leading wildcard requires a full scan of the data, or at least until it finds the 100000+10 rows. Even
... WHERE ... LIKE '%qzx%' ... LIMIT 10
is problematic, since there probably not 10 such names. So, a full scan of your million names.
... WHERE name LIKE 'James%' ...
will at least start in the middle of the table-- if there is an index starting with name. But still, the LIMIT and OFFSET might conspire to require reading the rest of the table.
Problem 2: (before you edited your Question!)
If you leave out the WHERE, do you really expect the user to page through a million names looking for something?
This is a UI problem.
If you have a million rows, and the output is ordered by Customer_name, that makes it easy to see the Aarons and the Zywickis, but not anyone else. How would you get to me (James)? Either you have 100K links and I am somewhere near the middle, or the poor user would have to press [Next] 'forever'.
My point is that the database is not the place to introduce efficiency.
In some other situations, it is meaningful to go to the [Next] (or [Prev]) page. In these situations, "remember where you left off", then use that to efficiently reach into the table. OFFSET is not efficient. More on Pagination
I use a special concept for this. First I have a table called pager. It contains an primary pager_id, and some values to identify a user (user_id,session_id), so that the pager data can't be stolen.
Then I have a second table called pager_filter. I consist of 3 ids:
pager_id int unsigned not NULL # id of table pager
order_id int unsigned not NULL # store the order here
reference_id int unsigned not NULL # reference into the data table
primary key(pager_id,order_id);
As first operation I select all records matching the filter rules from and insert them into pager_filter
DELETE FROM pager_filter WHERE pager_id = $PAGER_ID;
INSERT INTO pager_filter (pager_id,order_id,reference_id)
SELECT $PAGER_ID pager_id, ROW_NUMBER() order_id, data_id reference_id
FROM data_table
WHERE $CONDITIONS
ORDER BY $ORDERING
After filling the filter table you can use an inner join for pagination:
SELECT d.*
FROM pager_filter f
INNER JOIN data_table d ON d.data_id = f.reference id
WHERE f.pager_id = $PAGER_ID && f.order_id between 100000 and 100099
ORDER BY f.order_id
or
SELECT d.*
FROM pager_filter f
INNER JOIN data_table d ON d.data_id = f.reference id
WHERE f.pager_id = $PAGER_ID
ORDER BY f.order_id
LIMIT 100 OFFSET 100000
Hint: All code above is not tested pseudo code

How to reduce subquery execution time...?

I want per day sales item count so for that one i already created query but it takes to much around 55.585s and query is
Query :
SELECT
td.db_date,
(
select count(*) from order as order where DATE(order.created_on) = td.db_date
)as day_contribute
FROM time_dimension as td
So can any one please let me know how may i optimized this query and reduce execution time.?
You can modify your query to join like:
SELECT
td.db_date, count(order.id) as day_contribute
FROM time_dimension as td
LEFT JOIN order ON DATE(order.created_on) = td.db_date
GROUP BY td.db_date;
I do not know your primary id key for table order - so used just "order.id". Replace it with your.
Also it is very important - test if you have index on td.db_date field.
And one more important thing - better to avoid using DATE(order.created_on). Because it is mean that DATE() method will be called each time when DB will compare dates. If it is possible - convert order.created_on to same format as td.db_date. Or join by other fields. That will add speed too.
First you should make sure you have index on created_on column in order table.
However if you have many records in time_dimension and many records in order table it might be hard to optimize the query, because for each record from time_dimension you need to search in order table.
You can also change count(*) into count(order_id) (assuming primary key in order table is order_id) or add extra column with date only in order table (created_on_date with date only and index on this column) so your query could look like this:
SELECT
td.db_date,
(
select count(order_id) from order where order.created_on_date = td.db_date
)as day_contribute
FROM time_dimension as td
However it's possible the execution time might be too high if you have many records in both tables, so it might be necessary to create one extra table where you hold number of orders for each day and update it in cron or when adding/updating/deleting records in order table

MYSQL Multi Table Delete

I have 2 tables setup like this:
Items:
id (int)
name (varchar)
category (int)
last_update (timestamp)
Categories
id (int)
name (varchar)
tree_id (int)
I want to delete all the records from Items, who's last_update is NOT today, and who's corresponding categories.tree_id is equal to 1. However, I don't want to delete anything from the Categories table, just the Items table. I tried this:
$query = "DELETE FROM items USING categories, items
WHERE items.category = categories.id
AND categories.tree_id = 1
AND items.last_update != '".date('Y-m-d')."'";
However this just seems to delete EVERY record who's tree_id is 1. It should keep items with a tree_id of 1, as long as their last_update field is today
What am I missing?
If last_update is a timestamp field,and you are only passing in a date (with no time component) in your where clause, you are, in essence, actually doing this (if passing in 2012-10-24 for example):
AND items.last_update != 2012-10-24 00:00:00
This means every row without that exact second value in the timestamp would be deleted. You are much better doing something like
AND items.last_update NOT LIKE '".date('Y-m-d')."%'";
Of course you want to make sure you have an index on last_update.
Or if you don't care about index performance on last_update field (i.e. you are just doing this as a one off query and don't want to index this field), you could do this, which may make more logical sense to some
AND DATE(items.last_update) <> '".date('Y-m-d')."'"
The bottom line is that you need to only be comparing the date component of the last_updated field in some manner.
Sounds like you need a subquery
DELETE FROM items USING categories, items
WHERE items.category = categories.id
AND items.last_update != '".date('Y-m-d')."'"
AND items.id in
(
SELECT id from items inner join categories on item.category = categories.id
)
You say that last_update contains time-stamps - I assume UNIX time-stamp. You can never find a record with the same time-stamp as when the stamp was executed and you can not compare a time-stamp with formatted date, they'll never correspond. So you need to store the data in last_update column in a date(Y-m-d) format so that compare those which are not equal.

Is my JOIN + GROUP BY ... HAVING COUNT query correct?

I'm new to SQL and I want to implement the following query:
I've got two tables, LicenseTbl and UnlockTbl:
LicenseTbl contains information about a purchased software license:
LicenseID, ProgramID, Owner, Location, OrderNo, BlockTime
UnlockTbl contains information about a specific software registration:
UnlockID, LicenseID (foreign key into LicenseTbl), Timestamp, SerialNo, Key, UninstallTime
where BlockTime and UninstallTime contain a timestamp if the license was blocked or the software uninstalled and NULL otherwise.
I want to devise a query that gives me ALL LicenseIDs for which the following conditions hold:
belongs to a given customer,
is not blocked,
is either not listed in the UnlockTbl or there are < X different SerialNo's in lines which are not marked as uninstalled.
I have written this, but I'm not sure if it is absolutely correct (it's one of my first SQL queries ever):
SELECT LicenseID FROM LicenseTbl
JOIN UnlockTbl
ON (LicenseTbl.LicenseID = UnlockTbl.LicenseID)
WHERE LicenseTbl.OrderNo = '$givenOrderNo'
AND LicenseTbl.Owner = '$givenOwner'
AND LicenseTbl.Location = '$givenLocation'
AND LicenseTbl.BlockTime IS NULL
AND UnlockTbl.UninstallTime IS NULL
GROUP BY LicenseTbl.LicenseID, UnlockTbl.Key
HAVING COUNT(*) < $X
(which is supposed to mean, list all licenses which have only been used less than X times simultaneously. I would prefer those that have been used the least first but don't know how to sort like that.)
This is a good start, but I would change the query to the following...
SELECT
LicenseID
FROM
LicenseTbl
LEFT JOIN
UnlockTbl
ON UnlockTbl.LicenseID = LicenseTbl.LicenseID
AND UnlockTbl.UninstallTime IS NULL
WHERE
LicenseTbl.OrderNo = '$givenOrderNo'
AND LicenseTbl.Owner = '$givenOwner'
AND LicenseTbl.Location = '$givenLocation'
AND LicenseTbl.BlockTime IS NULL
GROUP BY
LicenseTbl.LicenseID
HAVING
COUNT(DISTINCT UnlockTbl.SerialNo) < $X
ORDER BY
COUNT(DISTINCT UnlockTbl.SerialNo)
1). LEFT JOIN
A LEFT JOIN ensures that all rows in LicenseTbl are returned, even if there are no matches in the UnlockTbl table. (If there are no matches, the UnlockTbl table's values are all represented as NULL.)
2). UnlockTbl.UninstallTime IS NULL in the JOIN and not the WHERE
The WHERE clause is applied after the JOIN. This means that any records in UnlockTbl where UninstallTime have a real value (NOT NULL) get joined and then get filtered out. This in turn means that if all the relevant records in UnlockTbl have a non-NULL value in UninstallTime, all the rows for that License will get filtered.
3). GROUP BY on just the license, not the Key.
Simply, I don't know why you had it there, and it doesn't appear in the English description of what you want.
As you want a list of LicenseIDs, grouping by only that field ensures that you get one record per LicenseID.
4). HAVING clause modified to look at COUNT(DISTINCT SerialNo)
COUNT(*) counts all records. Even if there was no match (All the UnlockTbl values appearing as NULL), this would return 1.
COUNT(SerialNo) counts only records where SerialNo is NOT NULL. If there was no match (All the UnlockTbl values appearing as NULL), this would return 0.
COUNT(DISTINCT SerialNo) also counts only records where SerialNo is NOT NULL, but treats duplicates of the sme value as just 1 entry.
5). ORDER BY COUNT(DISTINCT SerialNo)
Takes the same value as is being filtered in the HAVING clause, and orders by it.

Comparing rows in table for differences between fields

I have a table (client) with 20+ columns that is mostly historical data.
Something like:
id|clientID|field1|field2|etc...|updateDate
If my data looks like this:
10|12|A|A|...|2009-03-01
11|12|A|B|...|2009-04-01
19|12|C|B|...|2009-05-01
21|14|X|Y|...|2009-06-11
27|14|X|Z|...|2009-07-01
Is there an easy way to compare each row and highlight the differences in the fields?
I need to be able to simply highlight the fields that changed between revisions (except for the key and the date of course)
There may be multiple fields updated in each new row (or just one).
This would be on a client by client basis so I could select on the clientID to filter.
It could be on the server or client side, which ever is easiest.
More details
I should expand my description a little:
I'm looking to just see if there was a difference between the fields (one is different in any way). Some of the data is numeric, some is text others are dates. A more complete example might be:
10|12|A|A|F|G|H|I|J|...|2009-03-01
11|12|A|B|F|G|H|I|J|...|2009-04-01
19|12|C|B|F|G|Z|I|J|...|2009-05-01 ***
21|14|X|Y|L|M|N|O|P|...|2009-06-11
27|14|X|Z|L|M|N|O|P|...|2009-07-01
I'd want to be able to isplay each row for clientID 12 and highlight B from row 11 and C & Z from row 19.
Any expression in SQL must reference columns only in one row (barring subqueries).
A JOIN can be used to make two different rows into one row of the result set.
So you can compare values on different rows by doing a self-join. Here's an example that shows joining each row to every other row associated with the same client (excluding a join of a row to itself):
SELECT c1.*, c2.*
FROM client c1
JOIN client c2 ON (c1.clientID = c2.clientID AND c1.id <> c2.id)
Now you can write expressions that compare columns. For example, to restrict the above query to those where field1 differs:
SELECT c1.*, c2.*
FROM client c1
JOIN client c2 ON (c1.clientID = c2.clientID AND c1.id <> c2.id)
WHERE c1.field1 <> c2.field1;
You don't specify what kinds of comparisons you need to make, so I'll leave that to you. The key point is that in general, you can use a self-join to compare rows in a given table.
Re your comments and clarification: Okay, so your "difference" is not simply by value but by ordinal position of the row. Remember that relational databases don't have a concept of row number, they only have order of rows with respect to some order you must specify in an ORDER BY clause. Don't confuse the "id" pseudokey with row number, the numbers are assigned as monotonically increasing only by coincidence of their implementation.
In MySQL, you could take advantage of user-defined variables to achieve the effect you're looking for. Order the query by clientId and then by id, and track values per column in MySQL user variables. When the value in a current row differs from the value in the variable, do whatever highlighting you were going to do. I'll show an example for one field:
SET #clientid = -1, #field1 = '';
SELECT id, clientId, field1, #clientid, #field1,
IF(#clientid <> clientid,
((#clientid := clientid) AND (#field1 := field1)) = NULL,
IF (#field1 <> field1,
(#field1 := field1),
NULL
)
) AS field1_changed
FROM client c
ORDER BY clientId, id;
Note this solution is not really different from just selecting all rows with plain SQL, and tracking the values with application variables as you fetch rows.

Categories