JOIN or 2 queries - 1 large table, 1 small, hardware limited

JOIN or 2 queries - 1 large table, 1 small, hardware limited - php

I have a page in which there is a <select> menu, which contains all of the values from a small table (229 rows), such that <option value='KEY'>VALUE</option>.
This select menu is a filter for a query which runs on a large table (3.5M rows).
In the large table is a foreign key which references KEY from small table.
However, in the results of the large table query, I also need to display the relative VALUE from the small table.
I could quite easily do an INNER JOIN to retrieve the results, OR I could do a separate 'pre'-query to my smaller table, fetch it's values into an array, and then let the application get the VALUE from small table results.
The application is written in PHP.
Hardware resources IS an issue (cannot upgrade to higher instance right now, boss constrained) - I am running this on a t2.micro RDS on Amazon Web Services instance.
I have added both single and covering indexes on columns in WHERE & HAVING clauses, and my server is reporting that I have 46mb RAM available.
Given the above, I know that JOIN can be expensive especially on big tables. Does it just make sense here to do 2 queries, and let the application handle some of the work, until I can negotiate better resources?
EDIT:
No Join : 6.9 sec
SELECT nationality_id, COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns
GROUP BY nationality_id;
EXPLAIN
'1', 'SIMPLE', '1_txns', 'index', 'covering,nat', 'nat', '5', NULL, '3141206', NULL
With Join: 59.03 Sec
SELECT 4_nationality.nationality, COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns INNER JOIN 4_nationality USING (nationality_id)
GROUP BY nationality_id
HAVING YrQtr LIKE :period;
EXPLAIN
'1', 'SIMPLE', '4_nationality', 'ALL', 'PRIMARY', NULL, NULL, NULL, '229', 'Using temporary; Using filesort'
'1', 'SIMPLE', '1_txns', 'ref', 'covering,nat', 'nat', '5', 'reports.4_nationality.nationality_id', '7932', NULL
Schema is
Table 1_txns (txn_id, nationality_id, yrqtr, sales, units)
Table 4_nationality (nationality_id, nationality)
I have separate indexes on each nationality_id, txn_id, yrqtr. in my Large Transactions Table. And just a primary key index on my small table.
Something strange also, is that the query WITHOUT the join, is missing a row from it's results!

If your lookup "menu" list table is only the 229 rows as stated, and it has a unique key, and your menu table has index on (key, value), the join would be negligible... especially if your only querying the results based on a single key anyhow.
The bigger question to me would be on your table of 3.5 million records. At 229 "menu" items, it would be returning an average of over 15k records each time. And I am sure that not every category is evenly balanced... some could have a few hundred or thousand entries, others could have 30k+ entries. Is there some other criteria that would allow smaller subsets to be returned? Obviously not enough info to quantify.
Now, after seeing your revised post while entering this, I see you are trying to get aggregations. The table would otherwise be fixed for historical data. I would suggest a summary table be done on a per Nationality/YrQtr basis. This way, you can query that directly if the period is PRIOR to the current period in question. If current period, then sum aggregates from production. Again, since transactions wont change historically, neither would their counts and you would have immediate response from the pre-summary table.
Feedback
As for how / when to implement a summary table. I would create the table with the respective columns you need... Nationality, Period (Yr/Month), and respective counts for distinct transactions, etc.
I would then pre-aggregate once for all your existing data for everything UP TO but not including the current period (Yr/Month). Now you have your baseline established in summary.
Then, add a trigger to your transaction table on insert. Then, process something like... (AND NOTE, THIS IS NOT ACTUAL TRIGGER, BUT CONTEXT OF WHAT TO DO)
update summaryTable
set numTrans = numTrans + 1,
TotSales = TotSales + NEWENTRY.Sales,
TotUnits = TotUnits + NEWENTRY.Units
where
Nationality = NEWENTRY.Nationality
AND YrQtr = NEWENTRY.YrQtr
if # records affected by the update = 0
Insert into SummaryTable
( Nationality,
YrQtr,
NumTrans,
TotSales,
TotUnits )
values
( NEWENTRY.Nationality,
NEWENTRY.YrQtr,
1,
NEWENTRY.Sales,
NEWENTRY.Units )
Now, your aggregates will ALWAYS be in synch in the summary table after EVERY record inserted into the transaction table. You can ALWAYS query this summary table instead of the full transaction table. If you never have activity for a given Nationality / YrQtr, no record will exist.

First, move the HAVING to WHERE so that the rest of the query has less to do. Second, delay the lookup of nationality until after the GROUP BY:
SELECT
( SELECT nationality
FROM 4_nationality
WHERE nationality_id = t.nationality_id
) AS nationality,
COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales,
SUM(units) as units,
YrQtr
FROM 1_txns AS t
WHERE YrQtr LIKE :period
GROUP BY nationality_id;
If possible, avoid wild cards and simply do YrQtr = :period. That would allow INDEX(YrQtr, nationality_id) for even more performance.

Related

Improving the efficiency of a query that uses COUNT, MIN and MAX

I recently posted about what essentially boils down to the same issue but using a different database technology (meaning the solution found - which involved using ROW_NUMBER() - is not applicable here).
Lets say I have a table in a MySQL database called "Customers". I also have a table called "Orders", each of which contains a "CustomerID". What I want to do, is to generate a summary for each "Customer" of how many orders they have made, as well as when their first and last "Order" took place.
The query I have been using for this is as follows:
SELECT
Customer.CustomerID,
Customer.Name,
COUNT(Orders.OrderID) AS Orders,
MIN(Order.Timestamp) AS OldestOrder,
MAX(Orders.Timestamp) AS NewestOrder
FROM Orders
INNER JOIN Customerts ON Orders.OrderID = Customers.CustomerID
GROUP BY Orders.CustomerID
This query gets exactly what I want, but on a database containing several hundred thousand orders, it can take 2-3 seconds to execute.
By adding an index to the "Orders" table that includes "CustomerID" and "Timestamp", this time is brought down to around 1 second or less, but this is still unacceptable. The list of customers this query will be executed for is usually relatively small, so a loop through each customer that performs individual queries to obtain the data is a quicker option, but this is much more messy.
Are there further index opportunities I'm not seeing, or does this query need to function in a totally different way? If I had MSSQL's ROW_NUMBER() functionality at my disposal this query could work incredibly quickly...
Thanks in advance :)!
EDIT #1: EXPLAIN SELECT shows:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers ALL PRIMARY NULL NULL NULL 10 Using temporary; Using filesort
1 SIMPLE Orders ref CustomerID_2 CustomerID_2 4 Customers.CustomerID 4038 Using where

SELECT
Customers.CustomerID,
Customers.Name,
COUNT(Orders.OrderID) AS Orders,
MIN(Orders.Timestamp) AS OldestOrder,
MAX(Orders.Timestamp) AS NewestOrder
FROM Customers
INNER JOIN Orders ON Orders.CustomerID= Customers.CustomerID
GROUP BY Customers.CustomerID

It appear the answer was right in front of me! Replacing the index I mentioned in the OP which included "CustomerID" and "Timestamp" with one that also included "OrderID" brought the query down to around 0.07 seconds! This was then further reduced by roughly 50% by using the "Customers" table as the focal point, as outlined in Jignesh Patel's answer.

SQL INSERT INTO SELECT and Return the SELECT data to Create Row View Counts

So I'm creating a system that will be pulling 50-150 records at a time from a table and display them to the user, and I'm trying to keep a view count for each record.
I figured the most efficient way would be to create a MEMORY table that I use an INSERT INTO to pull the IDs of the rows into and then have a cron function that runs regularly to aggregate the view ID counts and clears out the memory table, updating the original one with the latest view counts. This avoids constantly updating the table that'll likely be getting accessed the most, so I'm not locking 150 rows at a time with each query(or the whole table if I'm using MyISAM).
Basically, the method explained here.
However, I would of course like to do this at the same time as I pull the records information for viewing, and I'd like to avoid running a second, separate query just to get the same set of data for its counts.
Is there any way to SELECT a dataset, return that dataset, and simultaneously insert a single column from that dataset into another table?
It looks like PostgreSQL might have something similar to what I want with the RETURNING keyword, but I'm using MySQL.

First of all, I would not add a counter column to the Main table. I would create a separate Audit table that would hold ID of the item from the Main table plus at least timestamp when that ID was requested. In essence, Audit table would store a history of requests. In this approach you can easily generate much more interesting reports. You can always calculate grand totals per item and also you can calculate summaries by day, week, month, etc per item or across all items. Depending on the volume of data you can periodically delete Audit entries older than some threshold (a month, a year, etc).
Also, you can easily store more information in Audit table as needed, for example, user ID to calculate stats per user.
To populate Audit table "automatically" I would create a stored procedure. The client code would call this stored procedure instead of performing the original SELECT. Stored procedure would return exactly the same result as original SELECT does, but would also add necessary details to the Audit table transparently to the client code.
So, let's assume that Audit table looks like this:
CREATE TABLE AuditTable
(
ID int
IDENTITY -- SQL Server
SERIAL -- Postgres
AUTO_INCREMENT -- MySQL
NOT NULL,
ItemID int NOT NULL,
RequestDateTime datetime NOT NULL
)
and your main SELECT looks like this:
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
To perform both INSERT and SELECT in one statement in SQL Server I'd use OUTPUT clause, in Postgres - RETURNING clause, in MySQL - ??? I don't think it has anything like this. So, MySQL procedure would have several separate statements.
MySQL
At first do your SELECT and insert results into a temporary (possibly memory) table. Then copy item IDs from temporary table into Audit table. Then SELECT from temporary table to return result to the client.
CREATE TEMPORARY TABLE TempTable
(
ItemID int NOT NULL,
Col1 ...,
Col2 ...,
...
)
ENGINE = MEMORY
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
;
INSERT INTO AuditTable (ItemID, RequestDateTime)
SELECT ItemID, NOW()
FROM TempTable;
SELECT ItemID, Col1, Col2, ...
FROM TempTable
ORDER BY ...;
SQL Server (just to tease you. this single statement does both INSERT and SELECT)
MERGE INTO AuditTable
USING
(
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(ItemID, RequestDateTime)
VALUES
(Src.ItemID, GETDATE())
OUTPUT
Src.ItemID, Src.Col1, Src.Col2, ...
;
You can leave Audit table as it is, or you can set up cron to summarize it periodically. It really depends on the volume of data. In our system we store individual rows for a week, plus we summarize stats per hour and keep it for 6 weeks, plus we keep daily summary for 18 months. But, important part, all these summaries are separate Audit tables, we don't keep auditing information in the Main table, so we don't need to update it.
Joe Celko explained it very well in SQL Style Habits: Attack of the Skeuomorphs:
Now go to any SQL Forum text search the postings. You will find
thousands of postings with DDL that include columns named createdby,
createddate, modifiedby and modifieddate with that particular
meta data on the end of the row declaration. It is the old mag tape
header label written in a new language! Deja Vu!
The header records appeared only once on a tape. But these meta data
values appear over and over on every row in the table. One of the main
reasons for using databases (not just SQL) was to remove redundancy
from the data; this just adds more redundancy. But now think about
what happens to the audit trail when a row is deleted? What happens to
the audit trail when a row is updated? The trail is destroyed. The
audit data should be separated from the schema. Would you put the log
file on the same disk drive as the database? Would an accountant let
the same person approve and receive a payment?

You're kind of asking if MySQL supports a SELECT trigger. It doesn't. You'll need to do this as two queries, however you can stick those inside a stored procedure - then you can pass in the range you're fetching, have it both return the results AND do the INSERT into the other table.
Updated answer with skeleton example for stored procedure:
DELIMITER $$
CREATE PROCEDURE `FetchRows`(IN StartID INT, IN EndID INT)
BEGIN
UPDATE Blah SET ViewCount = ViewCount+1 WHERE id >= StartID AND id <= EndID;
# ^ Assumes counts are stored in the same table. If they're in a seperate table, do an INSERT INTO ... ON DUPLICATE KEY UPDATE ViewCount = ViewCount+1 instead.
SELECT * FROM Blah WHERE id >= StartID AND id <= EndID;
END$$
DELIMITER ;

JOIN query too slow on real database, on small one it runs fine

I need help with this mysql query that executes too long or does not execute at all.
(What I am trying to do is a part of more complex problem, where I want to create PHP cron script that will execute few heavy queries and calculate data from the results returned and then use those data to store it in database for further more convenient use. Most likely I will make question here about that process.)
First lets try to solve one of the problems with these heavy queries.
Here is the thing:
I have table: users_bonitet. This table has fields: id, user_id, bonitet, tstamp.
First important note: when I say user, please understand that users are actually companies, not people. So user.id is id of some company, but for some other reasons table that I am using here is called "users".
Three key fields in users_bonitet table are: user_id ( referencing user.id), bonitet ( represents the strength of user, it can have 3 values, 1 - 2 - 3, where 3 is the best ), and tstamp ( stores the time of bonitet insert. Every time when bonitet value changes for some user, new row is inserted with tstamp of that insert and of course new bonitet value.). So basically some user can have bonitet of 1 indicating that he is in bad situation, but after some time it can change to 3 indicating that he is doing great, and time of that change is stored in tstamp.
Now, I will just list other tables that we need to use in query, and then I will explain why. Tables are: user, club, club_offer and club_territories.
Some users ( companies ) are members of a club. Member of the club can have some club offers ( he is representing his products to the people and other club members ) and he is operating on some territory.
What I need to do is to get bonitet value for every club offer ( made by some user who is member of a club ) but only for specific territory with id of 1100000; Since bonitet values are changing over time for each user, that means that I need to get the latest one only. So if some user have bonitet of 1 at 21.01.2012, but later at 26.05.2012 it has changed to 2, I need to get only 2, since that is the current value.
I made an SQL Fiddle with example db schema and query that I am using right now. On this small database, query is working what I want and it is fast, but on real database it is very slow, and sometimes do not execute at all.
See it here: http://sqlfiddle.com/#!9/b0d98/2
My question is: am I using wrong query to get all this data ? I am getting right result but maybe my query is bad and that is why it executes so slow ? How can I speed it up ? I have tried by putting indexes using phpmyadmin, but it didn't help very much.
Here is my query:
SELECT users_bonitet.user_id, users_bonitet.bonitet, users_bonitet.tstamp,
club_offer.id AS offerId, club_offer.rank
FROM users_bonitet
INNER JOIN (
SELECT max( tstamp ) AS lastDate, user_id
FROM users_bonitet
GROUP BY user_id
)lastDate ON users_bonitet.tstamp = lastDate.lastDate
AND users_bonitet.user_id = lastDate.user_id
JOIN users ON users_bonitet.user_id = users.id
JOIN club ON users.id = club.user_id
JOIN club_offer ON club.id = club_offer.club_id
JOIN club_territories ON club.id = club_territories.club_id
WHERE club_territories.territory_id = 1100000
So I am selecting bonitet values for all club offers made by users that are members of a club and operate on territory with an id of 1100000. Important thing is that I am selecting club_offer.id AS offerId, because I need to use that offerId in my application code so I can do some calculations based on bonitet values returned for each offer, and insert data that was calculated to the field "club_offer.rank" for each row with the id of offerId.

Your query looks fine. I suspect your query performance may be improved if you add a compound index to help the subquery that finds the latest entry from users_botinet for each user.
The subquery is:
SELECT max( tstamp ) AS lastDate, user_id
FROM users_bonitet
GROUP BY user_id
If you add (user_id, tstamp) as an index to this table, that subquery can be satisfied with a very efficient loose index scan.
ALTER TABLE users_bonitet ADD KEY maxfinder (user_id, tstamp);
Notice that if this users_botinet table had an autoincrementing id number in it, your subquery could be refactored to use that instead of tstamp. That would eliminate the possibility of duplicates and be even more efficient, because there's a unique id for joining. Like so.
FROM users_botinet
INNER JOIN (
SELECT MAX(id) AS id
FROM users_botinet
GROUP BY user_id
) ubmax ON users_botinet.id = ubmax.id
In this case your compound index would be (user_id, id.
Pro tip: Don't add lots of indexes unless you know you need them. It's a good idea to read up on how indexes can help you. For example. http://use-the-index-luke.com/

Slow Query Log - Rows Examined over 10 million, EXPLAIN shows under 10,000 - Why so high?

I have a database I'm working on that has some queries showing up in the slow query log.
There are 2 tables:
table1 is a table of businesses with standard info: name, phone, address, city, state, zip, etc. There is also a field for category. There are millions and millions of rows in this table.
table2 is a table of categories. There are only a couple of hundred rows.
The query in question is below:
# Query_time: 20.446852 Lock_time: 0.000044 Rows_sent: 20 Rows_examined: 11410654
use my_database;
SET timestamp=1331074576;
SELECT table1.id, name, phone, address, city, state, zip
FROM table1
INNER JOIN table2 ON table2.label=table1.category
WHERE state = 'tx' and city = 'San Antonio'
and category.label LIKE 'Health Care & Medical%' group by table1.id limit 0,20;
An EXTENDED EXPLAIN on the query looks like this:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table1 index indx_state,indx_city,index_category,cat_keywords PRIMARY 4 NULL 5465 946.92 Using where
1 SIMPLE table2 ref category_label category_label 602 my_table.table1.category 1 100.00 Using where; Using index
Here's the problem: This query is taking 20 seconds to run, showing in the slow query log and takes forever to load the html page.
The total records in table1 are over 10 million records, but 'San Antonio' only has 70,000 records. The total records matching the query (ignoring the limit) is only a couple of thousand. Indexes are set up on everything and the EXPLAIN appears to reflect this fact.
Why are the rows examined showing 11 million?
I feel this must be part of the reason the query is dragging so much.
Thanks as always....

I did follow some advice on this post and create an index on city,state. It didn't help my performance really, but another thing ended up helping. Quite possibly the fix I found would also have been more effective by putting an index on both columns.
The solution however, was to add the USE INDEX:
http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
By defining which index to use, the query time was reduced from 30 seconds to 1.5 seconds.
I don't know why it worked, but it did.

Seems like you need a composite index on the state and city fields.
ALTER TABLE table1 ADD INDEX(city, state);
I am using city as the first field, because I assume it will provide better selectivity. Also you cound use lookup tables and foreign keys on table1 and replace the string values. Performance will benefit and size will be reduced from not reusing the same string values over and over in the same table.

MYSQL query optimization

I'm trying to optimize a report query run on an ecommerce site. I'm pretty sure that I'm doing something stupid, since this query shouldn't be taking nearly as long to run as it does.
The query in question is:
SELECT inventories_name, inventories_code, SUM(shop_orders_inventories_qty) AS qty,
SUM(shop_orders_inventories_price) AS tot_price, inventories_categories_name,
inventories_price_list, inventories_id
FROM shop_orders
LEFT JOIN shop_orders_inventories ON (shop_orders_id = join_shop_orders_id)
LEFT JOIN inventories ON (join_inventories_id = inventories_id)
WHERE {$date_type} BETWEEN '{$start_date}' AND '{$end_date}'
AND shop_orders_x_response_code = 1
GROUP BY join_inventories_id, join_shop_categories_id
{$order}
{$limit}
It's basically trying to get total sales per item over a period of time; values in curly brackets are filled in via a form. It works fine for a period of a couple days, but querying a time interval of a week or more can take 30 seconds+.
I feel like it's joining way too many rows in order to calculate the aggregate values and sucking up huge amounts of memory, but I'm not sure how to limit it.
Note - I realize that I'm selecting fields which aren't in the group by, but they correspond 1-1 with inventory ID, which is in the group by.
Any suggestions?
-- Edit --
The current indices are:
inventories:
join_categories - BTREE
inventories_name, inventories_code, inventories_description - FULLTEXT
shop_orders_inventories:
shop_orders_inventories_id - BTREE
shop_orders:
shop_orders_id - BTREE

Two sequential left joins will work quite long on a big table. Try to use "join" instead of "left join" (unless you have records in shop_orders with now matching records in shop_orders_inventories or inventories) or split this query to couple of small ones. Also by using "sum" and "group by" you are forcing MySQL to create temp tables - you might want to increase MySQL cache so those tables would fit in to memory (otherwise MySQL will dump them to disk which will also increase SQL execution time).

The first and foremost rule to indexing is... index the columns that you will search on!
For each possible value of {$date_type}, create an index for that date column.
Once you have lots of data in the table (say 2 years or 100 weeks), a single week's data is 1% of the index, so it becomes a good starting point.
Even though MySQL allows non-aggregates in the SELECT clause, I personally would sync the two
SELECT inventories_name, inventories_code,
SUM(shop_orders_inventories_qty) AS qty,
SUM(shop_orders_inventories_price) AS tot_price,
inventories_categories_name, inventories_price_list, inventories_id
FROM ...
GROUP BY inventories_id, join_shop_categories_id, inventories_name,
inventories_code, inventories_categories_name, inventories_price_list
...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.