In short my question is this: Why is this
SELECT r.x, r.y FROM `base` AS r
WHERE r.l=50 AND AND r.n<>'name' AND 6=(SELECT COUNT(*) FROM surround AS d
WHERE d.x >= r.x -1 AND d.x <= r.x +1 AND
d.y>=r.y -1 AND d.y<=r.y +1 AND d.n='name')
a lot slower than this:
$q="SELECT x,y FROM `base` WHERE l=50 AND n<>'name'";
$sr=mysql_query($q);
if(mysql_num_rows($sr)>=1){
while($row=mysql_fetch_assoc($sr)){
$q2="SELECT x,y FROM surround WHERE n='name' AND x<=".
($row["x"]+1)." AND x>=".($row["x"]-1).
" AND y<=".($row["y"]+1)." AND y>=".($row["y"]-1)." ";
$sr2=mysql_query($q2);
if(mysql_num_rows($sr2)=6){
echo $row['x'].','.$row[y].'\n';
}
}
}
The php version takes about 300 ms to complete, if I run the "pure SQL" version, be it via phpadmin or via php, that takes roughly 5 seconds (and even 13 seconds when I used BETWEEN for those ranges of x and y)
I would suspect that the SQL version would in general be faster, and more efficient at least, so I wonder, am I doing something wrong, or does it make sense?
EDIT: I added the structure of both tables, as requested:
CREATE TABLE IF NOT EXISTS `base` (
`bid` int(12) NOT NULL COMMENT 'Base ID',
`n` varchar(25) NOT NULL COMMENT 'Name',
`l` int(3) NOT NULL,
`x` int(3) NOT NULL,
`y` int(3) NOT NULL,
`LastModified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY `coord` (`x`,`y`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `surround` (
`bid` int(12) NOT NULL COMMENT 'Base ID',
`n` varchar(25) NOT NULL COMMENT 'Name',
`l` int(3) NOT NULL,
`x` int(3) NOT NULL,
`y` int(3) NOT NULL,
`LastModified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY `coord` (`x`,`y`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
EDIT 2:
EXPLAIN SELECT for the query above: (the key coord is the combination of x and y)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY r range coord,n coord 4 NULL 4998 Using where
2 DEPENDENT SUBQUERY d ALL coord NULL NULL NULL 57241 Range checked for each record (index map: 0x1)
You are joinning two tables by yourself. you're an optimizer. you choice 'base' table is outer table for nested loop join. I guess MySQL's optimizer produced execution plan and it was not same as you.
so people want EXPLAIN output to see join order and to check index was used.
by the way, can you try this query?:
SELECT r.x, r.y
FROM `base` AS r, surround AS d
WHERE r.l=50
AND r.n<>'name'
AND d.x >= r.x -1
AND d.x <= r.x +1
AND d.y>=r.y -1
AND d.y<=r.y +1
AND d.n='name'
GROUP BY r.x, r.y
HAVING COUNT(*) = 6
UPDATED
how your original query works
It was first time seeing Range checked for each record (index map: 0x1) so I can't figure out how your query works. MySQL Manual gives us some information about it. It seems like that every row in surround (surround has 57k rows?) is compare to base's x,y. If so, your query is evaluated using 3 depth nested loop join. (base => surround => base) and moreover every row in surround is compared (this is inefficient)
I will make more effort to find how it works later. It's time to work.
Related
This is how my code looks like:
foreach ($instruments as $instrument) {
$stmt = $pdo->prepare("SELECT date, adjusted_close, close FROM ehd_historical_data WHERE exchange = ? AND symbol = ? AND date >= ? ORDER BY date asc LIMIT 1");
$stmt->execute([xyzToExchange($instrument2), xyzToSymbol($instrument2), $startDate]);
$data1 = $stmt->fetch(PDO::FETCH_ASSOC);
$stmt = $pdo->prepare("SELECT date, adjusted_close, close FROM ehd_historical_data WHERE exchange = ? AND symbol = ? ORDER BY date desc LIMIT 1");
$stmt->execute([xyzToExchange($instrument2), xyzToSymbol($instrument2)]);
$data2 = $stmt->fetch(PDO::FETCH_ASSOC);
}
There are around 2000 instruments that are string in this format "NASDAQ:AAPL".
It currently takes 7 seconds to complete since the database has around 50 million rows.
So far:
I have set INDEX for exchange, symbol and date together.
Set another INDEX for exchange and symbol together.
I want to ask further what can I do to optimize this query.
Note:
The function which this code is part of tries to find the price difference and the percent change between the start date and today's date. The start date can be anything like 6 months ago, 3 months ago.
I tried merging them in one large query and then executing them. Still same problem.
Update:
EXPLAIN for both queries
Table Schema
CREATE TABLE `ehd_historical_data` (
`exchange` varchar(255) NOT NULL,
`symbol` varchar(255) NOT NULL,
`date` date NOT NULL,
`open` decimal(20,10) NOT NULL,
`high` decimal(20,10) NOT NULL,
`low` decimal(20,10) NOT NULL,
`close` decimal(20,10) NOT NULL,
`adjusted_close` decimal(20,10) NOT NULL,
`volume` decimal(20,0) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
ALTER TABLE `ehd_historical_data`
ADD UNIQUE KEY `exchange_2` (`exchange`,`symbol`,`date`),
ADD KEY `exchange` (`exchange`),
ADD KEY `date` (`date`),
ADD KEY `symbol` (`symbol`),
ADD KEY `exchange_3` (`exchange`,`symbol`);
COMMIT;
Try selecting both rows in a single query using row_number()
select *
from (
SELECT date, adjusted_close, close,
row_number() over(order by date desc) rn1,
row_number() over(order by date asc) rn2
FROM ehd_historical_data
WHERE exchange = ? AND symbol = ? AND date >= ?
) t
where rn1 = 1 or rn2 = 1
You may also request all symbols at once. Note a partition clause
select *
from (
SELECT exchange, symbol, date, adjusted_close, close,
row_number() over(partition by exchange, symbol order by date desc) rn1,
row_number() over(partition by exchange, symbol order by date asc) rn2
FROM ehd_historical_data
WHERE ((exchange = 'NASDAQ' AND symbol = 'AAPL') OR (exchange = 'NASDAQ' AND symbol = 'MSFT') OR (exchange = 'NASDAQ' AND symbol = 'TSLA')) AND date >= ?
) t
where rn1 = 1 or rn2 = 1
Your index (exchange,symbol,date) is optimal for both of those SELECTs, so let's dig into other causes for sluggishness.
CREATE TABLE `ehd_historical_data` (
`exchange` varchar(255) NOT NULL, -- Don't use 255 if you don't need it
`symbol` varchar(255) NOT NULL, -- ditto
`date` date NOT NULL,
`open` decimal(20,10) NOT NULL, -- overkill
`high` decimal(20,10) NOT NULL,
`low` decimal(20,10) NOT NULL,
`close` decimal(20,10) NOT NULL,
`adjusted_close` decimal(20,10) NOT NULL,
`volume` decimal(20,0) NOT NULL
) ENGINE=InnoDB DEFAULT
CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; -- probably all are ascii
ALTER TABLE `ehd_historical_data`
ADD UNIQUE KEY `exchange_2` (`exchange`,`symbol`,`date`), -- Change to PRIMARY KEY
ADD KEY `exchange` (`exchange`), -- redundant, DROP
ADD KEY `date` (`date`),
ADD KEY `symbol` (`symbol`),
ADD KEY `exchange_3` (`exchange`,`symbol`); -- redundant, DROP
COMMIT;
Have another table of symbols; use a MEDIUMINT UNSIGNED in this table.
decimal(20,10) takes 10 bytes; I know of no symbol that needs that much precision or range.
The above comments are aimed at making the table smaller. If the table is currently bigger than will fit in cache, I/O will be the cause of sluggishness.
How much RAM do you have? What is the value of `innodb_buffer_pool_size?
How fast does this run? (I'm thinking there might be a way to get all 2000 results in a single SQL. This might be a component of it.)
SELECT exchange, symbol,
MIN(date) AS date1,
MAX(date) AS date2
FROM ehd_historical_data
WHERE date > ?
GROUP BY exchange, symbol
That would be JOINed back to the table twice, once for each date.
I would like to have some statistics and calculate the percentages of which tools have been chosen the most overall in all the registrations of my database
These are my two tables:
$table_registration = $wpdb->prefix . 'registration';
$table_tools = $wpdb->prefix . 'tools';
wp_registration table:
CREATE TABLE $table_registration
(
reg_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
dato date,
billedeURL VARCHAR(80) NOT NULL,
fiske_vaegt DECIMAL( 2,1 ) NOT NULL,
fiske_laengde INT NOT NULL,
reg_user_id BIGINT UNSIGNED NOT NULL,
reg_tools_id INT UNSIGNED NOT NULL
PRIMARY KEY (reg_id),
FOREIGN KEY (reg_user_id) REFERENCES wp_users(id),
FOREIGN KEY (reg_tools_id) REFERENCES $table_tools(tools_id)
)
wp_tools table:
CREATE TABLE $table_tools
(
tools_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
tools_navn CHAR (20),
PRIMARY KEY (tools_id)
)
I have been trying to create the correct mysql but with no luck so this is what I've been doing up till now.
select l.*, concat(round(100 * count(t.reg_tools_id) / t2.cnt,0),'%')
from wp_registration l
left join wp_tools t on l.toolss_id = t.reg_id
cross join
(select count(*) cnt
from wp_registration
where reg_tools_id = 1) t2
group by l.reg_id;
But it tells me that every tool has been used 50% of the times. which obviously is wrong I have three tools users can choose from and right now have 1 - two votes and 2 - nine votes and 3 - two votes there are 13 registrations in total
Hopefully, I understand what do you need !
SELECT
tools.tools_id,
((COUNT(*) / (SELECT COUNT(*) FROM registration)) * 100) AS percent
FROM
registration
JOIN
tools ON registration.reg_tools_id = tools.tools_id
GROUP BY
tools.tools_id
ORDER BY
percent DESC
LIMIT 1
Some remarks :
Try to write in a pure sql
You do not need a php tag for this question
Minimize your code from unnecessary part
Use the concat and round functions in the programming language that you are using not in SQL (I think you are using php here, so do the query then get the result and apply the round and the concat in php instructions)
I am building a color search function utilizing php and mysql. The requirement of the search is that it needs to be fast, not use joins, and allow for 1-5 hex color inputs that query the database and return the "most accurate" results. By "most accurate" I mean that results will be reflective of the search input. I have a few pieces of data to help that such as the distance between the mapped color value (mapped against an array of pre-defined colors) and the original search input hex value (eg. ff0000).
The way the color search engine works is that you input 1-5 hex values (eg. #ff0000, #000000, #9ef855, etc), click search, and it searches the database to find images that contain the highest percentage of those colors. See this color search for reference to how a color search engine works. Note: I built this one, but it has a completely different schema, which has scaling problems and cant add indexes because the number of colors is directly related to the number of table columns which is 120. Suggesting I use what I have built is out of the question for right now.
The data in the database comes from measurements taken on images. Up to 5 colors are extracted from an image, and then each hex color value (hex) is mapped to the closest predefined hex value (map_hex). Both of these pieces of data as well as the following are stored in the database:
media_id
hex (actual true value from image measurement)
map_hex (mapped value of the previous hex value)
percentage (the amount of this color found in the image)
distance (the distance between the true hex value and the mapped hex value)
sequence (unix timestamp, for ordering)
Before a color search query gets sent to the database, it is mapped to a set of colors so we can use the mapping to do a direct lookup on map_hex. This to me seemed like a faster way than trying to do a range type of query.
As of right now I am experimenting with two database design schemas but both seem to have their own problems.
Schema 1
CREATE TABLE `media_has_colors` (
`media_id` int(9) unsigned NOT NULL,
`hex` varchar(6) NOT NULL DEFAULT '',
`map_hex` varchar(6) NOT NULL,
`percentage` double unsigned NOT NULL,
`distance` double unsigned NOT NULL,
`sequence` int(11) unsigned NOT NULL,
PRIMARY KEY (`media_id`,`hex`),
KEY `index_on_hex` (`hex`),
KEY `index_on_percentage` (`percentage`),
KEY `index_on_timestamp` (`sequence`),
KEY `index_on_media_id` (`media_id`),
KEY `index_on_mapping_distance` (`distance`),
KEY `index_on_mapping_hex` (`map_hex`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Sample query:
SELECT sql_no_cache media_id, hex, map_hex, distance,
avg(percentage) as percentage,
SUM((IF(map_hex = '61615a',1,0)) + (IF(map_hex = '34362d',1,0)) + (IF(map_hex = 'dbd5dd',1,0))) as matchCount
FROM media_has_colors
WHERE map_hex = '61615a' or map_hex = '34362d' or map_hex = 'dbd5dd'
GROUP BY media_id
ORDER BY matchCount DESC, distance, percentage DESC
LIMIT 100;
The First problem I see with schema 1 is that I am forced to use group by and sum. I'll admit I have not tested with a ton of records yet but it seems like it could get slow. On top of that I can't tell what map_hex values are matching (which is why I'm trying to get with matchCount.
Schema 2
CREATE TABLE `media_has_colors` (
`media_id` int(9) unsigned NOT NULL,
`color_1_hex` varchar(6) NOT NULL DEFAULT '',
`color_2_hex` varchar(6) NOT NULL DEFAULT '',
`color_3_hex` varchar(6) NOT NULL DEFAULT '',
`color_4_hex` varchar(6) NOT NULL DEFAULT '',
`color_5_hex` varchar(6) NOT NULL DEFAULT '',
`color_1_map_hex` varchar(6) NOT NULL DEFAULT '',
`color_2_map_hex` varchar(6) NOT NULL DEFAULT '',
`color_3_map_hex` varchar(6) NOT NULL DEFAULT '',
`color_4_map_hex` varchar(6) NOT NULL DEFAULT '',
`color_5_map_hex` varchar(6) NOT NULL DEFAULT '',
`color_1_percent` double unsigned NOT NULL DEFAULT '0',
`color_2_percent` double unsigned NOT NULL DEFAULT '0',
`color_3_percent` double unsigned NOT NULL DEFAULT '0',
`color_4_percent` double unsigned NOT NULL DEFAULT '0',
`color_5_percent` double unsigned NOT NULL DEFAULT '0',
`color_1_distance` double unsigned NOT NULL DEFAULT '0',
`color_2_distance` double unsigned NOT NULL DEFAULT '0',
`color_3_distance` double unsigned NOT NULL DEFAULT '0',
`color_4_distance` double unsigned NOT NULL DEFAULT '0',
`color_5_distance` double unsigned NOT NULL DEFAULT '0',
`sequence` int(11) unsigned NOT NULL,
PRIMARY KEY (`media_id`),
KEY `index_on_timestamp` (`sequence`),
KEY `index_on_map_hex` (`color_1_map_hex`,`color_2_map_hex`,`color_3_map_hex`,`color_4_map_hex`,`color_5_map_hex`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This second schema is not as simple but it does avoid using group by only allowing 1 row per media. However, it seems to have the same problem of figuring out what map_hex values are matching. Here is a sample query:
SELECT sql_no_cache media_id,
(IF(color_1_percent = '61615a',color_1_percent,1)) *
(IF(color_2_percent = '34362d',color_2_percent,1)) *
(IF(color_3_percent = 'dbd5dd',color_3_percent,1)) as percentage,
(IF(color_1_distance = '61615a',color_1_distance,1)) +
(IF(color_2_distance = '34362d',color_2_distance,1)) +
(IF(color_3_distance = 'dbd5dd',color_3_distance,1)) as distance,
color_1_map_hex, color_2_map_hex, color_3_map_hex, color_4_map_hex, color_5_map_hex,
(IF(color_1_map_hex = '61615a',1,1)) +
(IF(color_2_map_hex = '34362d',1,1)) +
(IF(color_3_map_hex = 'dbd5dd',1,1)) as matchCount
FROM media_has_colors
WHERE color_1_map_hex IN ('61615a','34362d','dbd5dd') OR
color_2_map_hex IN ('61615a','34362d','dbd5dd') OR
color_3_map_hex IN ('61615a','34362d','dbd5dd')
ORDER BY matchCount DESC, distance, percentage DESC
LIMIT 100;
You can see that there is a problem with calculating percentage and distance because the actual map_hex value may not appear in those specific columns.
Update:
I don't need to know specifically what colors matched in the query but I do need to sort by which has the highest matches.
So my question is, How can the schema or queries be fixed? If not, is there a better solution?
I've a big table with about 20 millions of rows and every day it grows up and I've a form which get a query from this table. Unfortunately query returns hundreds of thousands of rows.
Query is based on Time, and I need all records to classify them by 'clid' base on some rules.So I need all records to do some process on them to make a result table.
This is my table :
CREATE TABLE IF NOT EXISTS `cdr` (
`gid` bigint(20) NOT NULL AUTO_INCREMENT,
`prefix` varchar(20) NOT NULL DEFAULT '',
`id` bigint(20) NOT NULL,
`start` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`clid` varchar(80) NOT NULL DEFAULT '',
`duration` int(11) NOT NULL DEFAULT '0',
`service` varchar(20) NOT NULL DEFAULT '',
PRIMARY KEY (`gid`),
UNIQUE KEY `id` (`id`,`prefix`),
KEY `start` (`start`),
KEY `clid` (`clid`),
KEY `service` (`service`)
) ENGINE=InnoDB DEFAULT CHARSET=utf-8 ;
and this is my query :
SELECT * FROM `cdr`
WHERE
service = 'test' AND
`start` >= '2014-02-09 00:00:00' AND
`start` < '2014-02-10 00:00:00' AND
`duration` >= 10
Date period could be various from 1 hour to maybe 60 day or even more.(like :
DATE(start) BETWEEN '2013-02-02 00:00:00' AND '2014-02-03 00:00:00'
)
The result set has about 150,000 rows for every day. When i try to get result for bigger period or even one day database crashes.
Does anybody have any idea ?
I don't know how to prevent it from crashing, but one thing that I did with my large tables was partition them by date.
Here, I partition the rows by date, twice a month. As long as your query uses the partitioned column, it will only search the partitions containing the key. It will not do a full table scan.
CREATE TABLE `identity` (
`Reference` int(9) unsigned NOT NULL AUTO_INCREMENT,
...
`Reg_Date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`Reference`),
KEY `Reg_Date` (`Reg_Date`)
) ENGINE=InnoDB AUTO_INCREMENT=28424336 DEFAULT CHARSET=latin1
PARTITION BY RANGE COLUMNS (Reg_Date) (
PARTITION p20140201 VALUES LESS THAN ('2014-02-01'),
PARTITION p20140214 VALUES LESS THAN ('2014-02-14'),
PARTITION p20140301 VALUES LESS THAN ('2014-03-01'),
PARTITION p20140315 VALUES LESS THAN ('2014-03-15'),
PARTITION p20140715 VALUES LESS THAN (MAXVALUE)
);
So basically, you just do a dump of the table, create it with partitions and then import the data into it.
Optimizng MySQL queries isn't my expertise, so I was wondering if someone could help me formulate the most optimal query here (and indices).
As background, I'm trying to find a distinct visitor id within a table of transactions with certain where criteria (date range, not a certain product, etc. as you see in the query below). Transactions and visitors have a one to many relationship, so there can be many transactions to a single visitor.
Another requirement for the results is that if a visitor_id is found in the result, it must be the first instance of a visitor_id (by date_time) in the entire table. In other words, the visitor_id should only exist in the date range set in the primary query and at no time beforehand.
Here's what I've put together so far. It uses NOT IN and a subquery, but this doesn't seem ideal because the query takes between 2-3 seconds being that the table has over 500k records. I've tried a few variations of indices, but nothing seems to really work.
Here's the query.
SELECT DISTINCT visitor_id, date_time
FROM pt_transactions
WHERE visitor_id NOT IN (SELECT visitor_id FROM pt_transactions WHERE date_time < '$this->_date_time_start')
AND campaign_id = $this->_campaign_id
AND a_aid = '$a_aid'
AND date_time >= '$this->_date_time_start'
AND date_time <= '$this->_date_time_end'
AND product_id != 65
And here's the complete table structure.
CREATE TABLE IF NOT EXISTS `pt_transactions` (
`id` int(32) NOT NULL AUTO_INCREMENT,
`type` varchar(2) NOT NULL COMMENT 'New Lead (NL), Raw Optin (RO), Base Sale (BS), Upsell Sale (US), Recurring Sale (RS), Base Refund (BR), Upsell Refund (UR), Recurring Refund (RR), Unknown Refund (XR), or Chargeback (C)',
`date_time` datetime NOT NULL,
`amount` varchar(255) NOT NULL,
`a_aid` varchar(255) NOT NULL,
`subid1` varchar(255) NOT NULL,
`subid2` varchar(255) NOT NULL,
`subid3` varchar(255) NOT NULL,
`product_id` int(16) NOT NULL,
`visitor_id` int(32) NOT NULL,
`campaign_id` int(16) NOT NULL,
`last_click_id` int(16) NOT NULL,
`trackback_type` varchar(255) NOT NULL COMMENT 'Shows if the transaction is tracked back to the original visitor via cookie or via IP. Usually only applies to sales via pixel.',
`original_transaction_id` int(32) NOT NULL COMMENT 'Reference to original transaction id, in this table, if type is RS, R, or C',
`recurring_transaction_id` varchar(32) NOT NULL COMMENT 'Reference to existing RecurringTransaction if type is RS',
PRIMARY KEY (`id`),
KEY `visitor_id` (`visitor_id`),
KEY `campaign_id` (`visitor_id`,`campaign_id`,`amount`,`product_id`),
KEY `transaction_retrieval_group` (`campaign_id`,`date_time`,`a_aid`),
KEY `type` (`type`),
KEY `date_time` (`date_time`),
KEY `original_source` (`campaign_id`,`a_aid`,`date_time`,`product_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=574636
You can try NOT EXISTS
SELECT DISTINCT visitor_id, date_time
FROM pt_transactions t
WHERE campaign_id = $this->_campaign_id
AND a_aid = '$a_aid'
AND date_time >= '$this->_date_time_start'
AND date_time <= '$this->_date_time_end'
AND product_id != 65
AND NOT EXISTS
(
SELECT *
FROM pt_transactions
WHERE visitor_id = t.visitor_id
AND date_time < '$this->_date_time_start'
)
Do EXPLAIN <query> and see how your indices are used. If you want you can post results in your question in a textual form.
From your query what i can understand is that...
Their is no need to write NOT IN Statement...
Because, you are already keeping a check for
date_time >= '$this->_date_time_start'
so thier is no need to check date_time < '$this->_date_time_start' in not NOT IN statement.
Only below should work fine :)
SELECT DISTINCT visitor_id, date_time
FROM pt_transactions
WHERE
AND campaign_id = $this->_campaign_id
AND a_aid = '$a_aid'
AND date_time >= '$this->_date_time_start'
AND date_time <= '$this->_date_time_end'
AND product_id != 65