Mysql where between query optimization - php

Below is the format of the database of Autonomous System Numbers ( download and parsed from this site! ).
range_start range_end number cc provider
----------- --------- ------ -- -------------------------------------
16778240 16778495 56203 AU AS56203 - BIGRED-NET-AU Big Red Group
16793600 16809983 18144 AS18144
745465 total rows
A Normal query looks like this:
select * from table where 3232235520 BETWEEN range_start AND range_end
Works properly but I query a huge number of IPs to check for their AS information which ends up taking too many calls and time.
Profiler Snapshot:
Blackfire profiler snapshot
I've two indexes:
id column
a combine index on the range_start and range_end column as both the make unique row.
Questions:
Is there a way to query a huge number of IPs in a single query?
multiple where (IP between range_start and range_end) OR where (IP between range_start and range_end) OR ... works but I can't get the IP -> row mapping or which rows are retrieved for which IP.
Any suggestions to change the database structure to optimize the query speed and decrease the time?
Any help will be appreciated! Thanks!

It is possible to query more than one IP address. Several approaches we could take. Assuming range_start and range_end are defined as integer types.
For a reasonable number of ip addresses, we could use an inline view:
SELECT i.ip, a.*
FROM ( SELECT 3232235520 AS ip
UNION ALL SELECT 3232235521
UNION ALL SELECT 3232235522
UNION ALL SELECT 3232235523
UNION ALL SELECT 3232235524
UNION ALL SELECT 3232235525
) i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip
This approach will work for a reasonable number of IP addresses. The inline view could be extended with more UNION ALL SELECT to add additional IP addresses. But that's not necessarily going to work for a "huge" number.
When we get "huge", we're going to run into limitations in MySQL... maximum size of a SQL statement limited by max_allowed_packet, there may be a limit on the number of SELECT that can appear.
The inline view could be replaced with a temporary table, built first.
DROP TEMPORARY TABLE IF EXISTS _ip_list_;
CREATE TEMPORARY TABLE _ip_list_ (ip BIGINT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO _ip_list_ (ip) VALUES (3232235520),(3232235521),(3232235522),...;
...
INSERT INTO _ip_list_ (ip) VALUES (3232237989),(3232237990);
Then reference the temporary table in place of the inline view:
SELECT i.ip, a.*
FROM _ip_list_ i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip ;
And then drop the temporary table:
DROP TEMPORARY TABLE IF EXISTS _ip_list_ ;
Some other notes:
Churning database connections is going to degrade performance. There's a significant amount overhead in establishing and tearing down a connection. That overhead get noticeable if the application is repeatedly connecting and disconnecting, if its doing that for every SQL statement being issued.
And running an individual SQL statement also has overhead... the statement has to be sent to the server, the statement parsed for syntax, evaluated from semantics, choose an execution plan, execute the plan, prepare a resultset, return the resultset to the client. And this is why it's more efficient to process set wise rather than row wise. Processing RBAR (row by agonizing row) can be very slow, compared to sending a statement to the database and letting it process a set in one fell swoop.
But there's a tradeoff there. With ginormous sets, things can start to get slow again.
Even if you can process two IP addresses in each statement, that halves the number of statements that need to be executed. If you do 20 IP addresses in each statement, that cuts down the number of statements to 5% of the number that would be required a row at a time.
And the composite index already defined on (range_start,range_end) is appropriate for this query.
FOLLOWUP
As Rick James points out in a comment, the index I earlier said was "appropriate" is less than ideal.
We could write the query a little differently, that might make more effective use of that index.
If (range_start,range_end) is UNIQUE (or PRIMARY) KEY, then this will return one row per IP address, even when there are "overlapping" ranges. (The previous query would return all of the rows that had a range_start and range_end that overlapped with the IP address.)
SELECT t.ip, a.*
FROM ( SELECT s.ip
, s.range_start
, MIN(e.range_end) AS range_end
FROM ( SELECT i.ip
, MAX(r.range_start) AS range_start
FROM _ip_list_ i
LEFT
JOIN ip_to_asn r
ON r.range_start <= i.ip
GROUP BY i.ip
) s
LEFT
JOIN ip_to_asn e
ON e.range_start = s.range_start
AND e.range_end >= s.ip
GROUP BY s.ip, s.range_start
) t
LEFT
JOIN ip_to_asn a
ON a.range_start = t.range_start
AND a.range_end = t.range_end
ORDER BY t.ip ;
With this query, for the innermost inline view query s, the optimizer might be able to make effective use of an index with a leading column of range_start, to quickly identify the "highest" value of range_start (that is less than or equal to the IP address). But with that outer join, and with the GROUP BY on i.ip, I'd really need to look at the EXPLAIN output; it's only conjecture what the optimizer might do; what is important is what the optimizer actually does.)
Then, for inline view query e, MySQL might be able to make more effective use of the composite index on (range_start,range_end), because of the equality predicate on the first column, and the inequality condition on MIN aggregate on the second column.
For the outermost query, MySQL will surely be able to make effective use of the composite index, due to the equality predicates on both columns.
A query of this form might show improved performance, or performance might go to hell in a handbasket. The output of EXPLAIN should give a good indication of what's going on. We'd like to see "Using index for group-by" in the Extra column, and we only want to see a "Using filesort" for the ORDER BY on the outermost query. (If we remove the ORDER BY clause, we want to not see "Using filesort" in the Extra column.)
Another approach is to make use of correlated subqueries in the SELECT list. The execution of correlated subqueries can get expensive when the resultset contains a large number of rows. But this approach can give satisfactory performance for some use cases.
This query depends on no overlapping ranges in the ip_to_asn table, and this query will not produce the expected results when overlapping ranges exist.
SELECT t.ip, a.*
FROM ( SELECT i.ip
, ( SELECT MAX(s.range_start)
FROM ip_to_asn s
WHERE s.range_start <= i.ip
) AS range_start
, ( SELECT MIN(e.range_end)
FROM ip_to_asn e
WHERE e.range_end >= i.ip
) AS range_end
FROM _ip_list_ i
) r
LEFT
JOIN ip_to_asn a
ON a.range_start = r.range_start
AND a.range_end = r.range_end
As a demonstration of why overlapping ranges will be a problem for this query, given a totally goofy, made up example
range_start range_end
----------- ---------
.101 .160
.128 .244
Given an IP address of .140, the MAX(range_start) subquery will find .128, the MIN(range_end) subquery will find .160, and then the outer query will attempt to find a matching row range_start=.128 AND range_end=.160. And that row just doesn't exist.

This is a duplicate of the question here however I'm not voting to close it, as the accepted answer in that question is not very helpful; the answer by Quassnoi is much better (but it only links to the solution).
A linear index is not going to help resolve a database of ranges. The solution is to use geospatial indexing (available in MySQL and other DBMS). An added complication is that MySQL geospatial indexing only works in 2 dimensions (while you have a 1-D dataset) so you need to map this to 2-dimensions.
Hence:
CREATE TABLE IF NOT EXISTS `inetnum` (
`from_ip` int(11) unsigned NOT NULL,
`to_ip` int(11) unsigned NOT NULL,
`netname` varchar(40) default NULL,
`ip_txt` varchar(60) default NULL,
`descr` varchar(60) default NULL,
`country` varchar(2) default NULL,
`rir` enum('APNIC','AFRINIC','ARIN','RIPE','LACNIC') NOT NULL default 'RIPE',
`netrange` linestring NOT NULL,
PRIMARY KEY (`from_ip`,`to_ip`),
SPATIAL KEY `rangelookup` (`netrange`)
) ENGINE=MyISAM DEFAULT CHARSET=ascii;
Which might be populated with....
INSERT INTO inetnum
(from_ip, to_ip
, netname, ip_txt, descr, country
, netrange)
VALUES
(INET_ATON('127.0.0.0'), INET_ATON('127.0.0.2')
, 'localhost','127.0.0.0-127.0.0.2', 'Local Machine', '.',
GEOMFROMWKB(POLYGON(LINESTRING(
POINT(INET_ATON('127.0.0.0'), -1),
POINT(INET_ATON('127.0.0.2'), -1),
POINT(INET_ATON('127.0.0.2'), 1),
POINT(INET_ATON('127.0.0.0'), 1),
POINT(INET_ATON('127.0.0.0'), -1))))
);
Then you might want to create a function to wrap the rather verbose SQL....
DROP FUNCTION `netname2`//
CREATE DEFINER=`root`#`localhost` FUNCTION `netname2`(p_ip VARCHAR(20) CHARACTER SET ascii) RETURNS varchar(80) CHARSET ascii
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE l_netname varchar(80);
SELECT CONCAT(country, '/',netname)
INTO l_netname
FROM inetnum
WHERE MBRCONTAINS(netrange, GEOMFROMTEXT(CONCAT('POINT(', INET_ATON(p_ip), ' 0)')))
ORDER BY (to_ip-from_ip)
LIMIT 0,1;
RETURN l_netname;
END
And therefore:
SELECT netname2('127.0.0.1');
./localhost
Which uses the index:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE inetnum range rangelookup rangelookup 34 NULL 1 Using where; Using filesort
(and takes around 10msec to find a record from the combined APNIC,AFRINIC,ARIN,RIPE and LACNIC datasets on the very low spec VM I'm using here)

You can compare IP ranges using MySQL. This question might contain an answer you're looking for: MySQL check if an IP-address is in range?
SELECT * FROM TABLE_NAME WHERE (INET_ATON("193.235.19.255") BETWEEN INET_ATON(ipStart) AND INET_ATON(ipEnd));
You will likely want to index your database. This optimizes the time it takes to search your database, similar to the index you will find in the back of a textbook, but for databases:
ALTER TABLE `table` ADD INDEX `name` (`column_id`)
EDIT: Apparently INET_ATON cannot be used on indexed databases, so you would have to pick one of these!

Related

MySQL query taking a long time on Join

I have the following mysql query which takes long time
SELECT `A`.*, max(B.timestamp) as timestamp2
FROM (`A`)
JOIN `B` ON `A`.`column1` = `B`.`column1`
WHERE `column2` = 'Player'
GROUP BY `column1`
ORDER BY `timestamp2` desc
I have index on TABLE A on column1 and indexes on table B are (column1,timestamp,column2),timestamp,column1.
When i use EXPLAIN it does not use timestamp index.
Try adding an index...
... ON `B` (`column2`,`column1`,`timestamp`)
with the columns in that order.
Without any information about datatype, we're going to guess that column2 is character type (and we're going to assume that the column is in table B, given the information about the current indexes.)
Absent any information about cardinality, we're going to guess that the number of rows that satisfy the equality predicate on column2 (in the WHERE clause) is a small subset of the total rows in B.
We expect that MySQL will use of a "range" scan operation, using an index that has column2 as a leading column.
Given that the new index is a "covering" index for the query, we also expect the EXPLAIN output to show "Using index" in the Extra column.
We also expect that MySQL can use the index to satisfy the GROUP BY operation and the MAX aggregate, without requiring a filesort operation.
But we are still going to see a filesort operation, used to satisfy the ORDER BY.

Caching big data, alternative query or other indexes?

I'm with a problem, I am working on highscores, and for those highscores you need to make a ranking based on skill experience and latest update time (to see who got the highest score first incase skill experience is the same).
The problem is that with the query I wrote, it takes 28 (skills) x 0,7 seconds to create a personal highscore page to see what their rank is on the list. Requesting this in the browser is just not doable, it takes way too long for the page to load and I need a solution for my issue.
MySQL version: 5.5.47
The query I wrote:
SELECT rank FROM
(
SELECT hs.playerID, (#rowID := #rowID + 1) AS rank
FROM
(
SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
) highscore,
(SELECT #rowID := 0) r
) data
WHERE data.playerID = ?
As you can see I first have to create a whole resultset that gives me a full ranking for that game mode and skill, and then I have to select the rank based on the playerID after that, the problem is that I cannot let the query run untill it finds the result, because mysql doesn't offer such function, if I'd specifiy where data.playerID = ? in the query above, it would give back 1 result, meaning the ranking will be 1 as well.
The highscores table has 550k rows
What I have tried was storing the resultset for each skillid/gamemode combination in a temp table json_encoded, tried storing on files, but it ended up being quite slow as well, because the files are really huge and it takes time to process.
Highscores table:
CREATE TABLE `highscores` (
`playerID` INT(11) NOT NULL,
`skillID` INT(10) NOT NULL,
`skillLevel` INT(10) NOT NULL,
`skillExperience` INT(10) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
PRIMARY KEY (`playerID`, `skillID`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Overall table has got 351k rows
Overall table:
CREATE TABLE `overall` (
`playerID` INT(11) NOT NULL,
`playerName` VARCHAR(50) NOT NULL,
`totalLevel` INT(10) NOT NULL,
`totalExperience` BIGINT(20) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
`game_mode` ENUM('REGULAR','IRON_MAN','IRON_MAN_HARDCORE') NOT NULL DEFAULT 'REGULAR',
PRIMARY KEY (`playerID`, `playerName`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Explain Select result from the query:
Does anybody have a solution for me?
No useful index for WHERE
The last 2 lines of the EXPLAIN (#3 DERIVED):
WHERE hs.skillID = ?
AND o.game_mode = ?
Since neither table has a suitable index to use for the WHERE clause, to optimizer decided to do a table scan of one of them (overall), then reach into the other (highscores). Having one of these indexes would help, at least some:
highscores: INDEX(skillID)
overall: INDEX(game_mode, ...) -- note that an index only on a low-cardinality ENUM is rarely useful.
(More in a minute.)
No useful index for ORDER BY
The optimizer sometimes decides to use an index for the ORDER BY instead of for the WHERE. But
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
cannot use an index, even though both are in the same table. This is because DESC and ASC are different. Changing ASC to DESC would have an impact on the resultset, but would allow
INDEX(skillExperience, updateTime)
to be used. Still, this may not be optimal. (More in a minute.)
Covering index
Another form of optimization is to build a "covering index". That is an index that has all the columns that the SELECT needs. Then the query can be performed entirely in the index, without reaching over to the data. The SELECT in question is the innermost:
( SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC, hs.updateTime ASC
) highscore,
For hs: INDEX(skillID, skillExperience, updateTime, playerID) is "covering" and has the most important item (skillID, from the WHERE) first.
For o: INDEX(game_mode, playerID) is "covering". Again, game_mode must be first.
If you change the ORDER BY to be DESC and DESC, then add another index for hs: INDEX(skillExperience, updateTime, skillID, playerID). Now the first 2 columns must be in that order.
Conclusion
It is not obvious which of those indexes the optimizer would prefer. I suggest you add both and let it choose.
I believe that (1) the innermost query is consuming the bulk of time, and (2) there is nothing to optimize in the outer SELECTs. So, I leave that as my recommendation.
Much of this is covered in my Indexing Cookbook.
Important subanswer: How frequently change rank of all players? Hmm.. Need explain.. You want realtime statistics? No, you dont want realtime )) You must select time interval for update statistics, e.g. 10 minutes. For this case you can run cronjob for insert new rank statistics into separated table like this:
/* lock */
TRUNCATE TABLE rank_stat; /* maybe update as unused/old for history) instead truncate */
INSERT INTO rank_stat (a, b, c, d) <your query here>;
/* unlock */
and users (browsers) will select readonly statistics from this table (can be split to pages).
But if rank stat not frequently change, e.g. you can recalculate it for all wanted game events and/or acts/achievs of players.
This is recommedations only. Because you not explain full environment. But I think you can found right solution with this recommendations.
It doesn't look like you really need to rank everyone, you just want to find out how many people are ahead of the current player. You should be able to get a simple count of how many players have better scores & dates than the current player which represents the current player's ranking.
SELECT count(highscores.id) as rank FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where highscores.skillID = ?
AND highscores.gamemode = ?
and playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
(I joined the table to itself and aliased the second instance as playerscore so it was slightly less confusing)
You could probably even simplify it to one query by grouping and parsing the results within your language of choice.
SELECT
highscores.gamemode as gamemode,
highscores.skillID as skillID,
count(highscores.id) as rank
FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
group by highscores.gamemode, highscores.skillID;
Not quite sure about the grouping bit though.

SQL query to make value monotonic?

I have a data table - a large one, with electricity consumption values.
Sometimes, due to a glitch, the value is smaller than the previous record, which then causes problems when processing.
monday 143 kWh
tuesday 140 kWh *glitch*
wednesday 150 kWh
I'd like to make the table monotonic. I'm interested in finding out if there is an sql query that will set each glitched value to the previous greatest value.
Is this possible to do without PHP?
The table is in the following format (when simplified a bit):
CREATE TABLE IF NOT EXISTS `history` (
`day` int(11) NOT NULL,
`value` float NOT NULL
)
I know how to do it in PHP, row by row, but if there's a cleaner SQL-only solution, that'd be superb!
You want the sequence to be "monotonic". "Monotonous" means boring.
If you have a lot of data, then the most efficient way is using variables:
select h.day,
(#max := greatest(#max, h.value)
from history h cross join
(select #max := -1) params
order by h.day;
If you actually want to update the values, then you can do basically the same thing:
update history h
set value = (#max := greatest(coalesce(#max + 0, 0), h.value)
order by h.day;
Note that in this case, #max defaults to a string variable. You cannot have both an order by and join in a an update query. So, either define the variable just before the update, or do a bit of string-to-number conversion.

MySQL query taking much time to load ? How to tune the database efficiently

I am having a table containing currently about 5 million rows.This is a live database where data is populated as a result of a scraping script.The script is inserting the data into the table continuously,
For example:
The business listing site is giving me a JSON response on API call,this is parsed and inserted into the database.A duplication check also happens in between.And on a later phase I am taking he data obtained to get reports.
While trying to take reports based on the stored information it's taking too long to complete the script execution.
The scraping script is live and continues to update the table with records in the future.
Every month its expected to get .7 - 1 million new records.
Following is the structure of my table,
CREATE TABLE IF NOT EXISTS `biz_listing` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lid` smallint(11) NOT NULL,
`name` varchar(300) NOT NULL,
`type` enum('cat1','cat2') NOT NULL,
`location` varchar(300) NOT NULL,
`businessID` varchar(300) NOT NULL,
`reviewcount` int(6) NOT NULL,
`city` varchar(300) NOT NULL,
`categories` varchar(300) NOT NULL,
`result_month` varchar(10) NOT NULL,
`updated_date` date NOT NULL,
PRIMARY KEY (`id`),
KEY `biz_date` (`businessID`,`updated_date`),
KEY `type_date` (`type`,`updated_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The records fall under two categories, 'cat1' and 'cat2' .
(I am plaaning to add a new category ,say cat3)
I need to have a same station aggregate report section,which shows business IDs which fall across every month in a selected range of months.
Here it is chosen as June-July 2014.
Report on aggregate numbers # category
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`type`
EXPLAIN (done on a backup table 4 million)
Report on aggregate numbers # based on cities
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`, t.`location` as city
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`location`, t.`result_month`
Here we selecting range of months (June-July) , so it will list all the businessID common in both range of months,
1st query will output according to type of Business
2nd query will output according to location
The problem is it considerably taking very long time to execute the query (600 seconds and more) also, some times the query dies before completion.
Please suggest me on optimizations for the query if you find so.
I think indexing is affecting insertion performance of the scraping script.
How can I modify the current script considering insertion and retrieval performance?
Thanx in advance.
EDIT
I tried the suggested covering indexes and its taking much more time than usual :(
EXPLAIN is as follows:
This is a MyISAM table, which offers less contention between inserting queries and reporting queries than InnoDB. Therefore, let's focus first on the reporting queries. It is true that indexes slow down inserts. But queries slow down a LOT because of missing or incorrect indexes.
To troubleshoot this performance problem it's helpful for clarity to consider the various subqueries separately, I believe.
So let's start with one of them.
SELECT `businessID`,
count(*) c
FROM `biz_listing`
WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01')
GROUP BY `businessID`,`type`
HAVING c = 2
This subquery is straightforward, and basically well-constructed. It's capable of using an index to jump to the first record meeting the updated_date range criterion, then linearly scan that index looking for the last record. As it scans the index, if it finds the type column in it, it can collect the record counts it needs to satisfy the query as it scans the index. That's fast.
But, you don't have that index! So this subquery is doing a full table scan. As we say in New England, that's wicked slow.
If you took your compound covering index (type,updated_date) index and exchanged the order of the two columns in it to give (updated_date,type), it would serve as a high-performance covering index for this query. The order in which the columns appear in your compound index is incorrect to make the index helpful for this query.
Let's take a look at your first main query in the same light (omitting the subquery).
SELECT COUNT(t.`businessID`) AS bizcount,
SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01')
GROUP BY t.`type`
(Something's not clear here. You say COUNT(t.businessID) here, but it's possible you want COUNT(DISTINCT t.businesscount). What you have will give the same result as COUNT(*) because there are no NULL values of businessID. If you do this, you can put HAVING SUM(DISTINCT businessID) > 2 in the query and get rid of your need for the subquery.)
This query works similarly to the previous one. It scans an index over the updated_date range, then by type, then picks up values of businessID and reviewcount. So a compound index in this order will allow this query to be satisfied by a pure index scan, which will be fast.
(updated_date, type, businessID,reviewcount)
Notice that any query that can be satisfied from the (updated_date, type) index can also be satisfied from this one, so you don't need them both.
Go read about compound covering indexes, tight range scans, and loose range scans.
Your other query will probably be greatly improved by this same index. Give it a try.
You have a backup table it seems. You can experiment with various compound indexes in that table until you get good results.
I'm reluctant to give this sort of advice:
TL;DR: change your indexes from this to that
because then you may just come back to SO with the next question and be tempted to become a support leech. Can I avoid being a "leech" when I am a beginner in a topic and only ask questions?
You know... teach a person to fish, etc.

mysql find smallest + unique id available

i have a column ID and something like 1000 items, some of then were removed like id=90, id=127, id=326
how can i make a query to look for those available ids, so i can reuse then for another item?
its like a min(ID) but i want to find only the ids that are NOT in my database, so if i remove a item with the ID = 90, next time i click on ADD ITEM i would insert it as id = 90
You can get the minimum available ID using this query:
SELECT MIN(t1.ID + 1) AS nextID
FROM tablename t1
LEFT JOIN tablename t2
ON t1.ID + 1 = t2.ID
WHERE t2.ID IS NULL
What it does is that it joins the table with itself and checks whether the min+1 ID is null or not. If it's null, then that ID is available. Suppose you have the table where ID are:
1
2
5
6
Then, this query will give you result as 3 which is what you want.
Do not reuse IDs. You usually have way enough available IDs so you don't have to care about fragmentation.
For example, if you re-use IDs, links from search engines might point to something completely unrelated from whatever is in the search index - showing a "not found" error is much better in such a case.
It's against the concept of surrogate keys to try to reuse IDs
The surrogate key is good because it idetifies the record itself, not some object in real life. If the record is gone, the ID is gone too.
Experienced DB developers are not afraid of running out of numbers because they know how many centuries it is needed to deplete, say, long integer numbers.
BTW, you may experience locking or violating uniqueness problems in a multithreaded environment with simultaneous transactions trying to find a gap in the ID sequence. The auto increment id generators provided by DB servers usually work outside the transactions scope and thus generate good surrogate keys.
Further reading: Surrogate keys
the query is like :
SELECT MIN(tableFoo.uniqueid + 1) AS nextID
FROM tableFoo
LEFT JOIN tableFoo tf1
ON tableFoo.uniqueid + 1 = tf1.uniqueid
WHERE tf1.uniqueid IS NULL
Note that the answers by shamittomar and Haim Evgi don't work if the lowest ID is free. To allow for the refilling the lowest ID, pre-check to see whether it is available:
SELECT TRUE FROM tablename WHERE ID = 1;
If this returns anything, then the ID of 1 is not free and you should use their answer. But if the ID of 1 is free, just use that.
In my personal opinion. Instead of removing the row from the auto increment it would be light years less expensive to have Boolean Column for "Removed" or "Deleted" and for extra security over right the row with blanks while you set the removed flag.
UPDATE table SET data=" ", removed = TRUE WHERE id = ##
(## is the actual id btw)
Then you can
SELECT * FROM table WHERE removed = TRUE ORDER BY id ASC
This will make your Database perform better and save you dough on servers. Not to mention ensure no nasty errors occur.
Given that your database is small enough, the correct answer is to not reuse your ids at all and just ensure its an auto incremented primary key. The table is a thousand records, so you can do this without any cost.
However, if you have a table of a few million records/longer id, you will find that the accepted answer wont finish in sensible time.
The accepted answer will give you the smallest of these values, correctly so, however, you are paying the price of not using an auto increment column, or if you have one, not using the auto increment column as the actual ID as it is intended (Like me, else I wouldn't be here). I'm at the mercy of a legacy application were the ID isn't the actual primary key is being used, and is randomly generated with a lolgorithm for no good reason, so I needed a means to replace that since upping the column range is now an extremely costly change.
Here, it is figuring out the entire
join between the entirety of t1 and t2 before reporting what the min of those joins is. In essence, you only care about the first NULL t1 that is found, regardless of whether it actually is the smallest or not.
So you'd take the MIN out and add a LIMIT of 1 instead.
edit : Since its not a primary key, you will also need to check for not null, since a primary key field cant be null
SELECT t1.ID + 1 AS nextID
FROM tablename t1
LEFT JOIN tablename t2
ON t1.ID + 1 = t2.ID
WHERE t2.ID IS NULL
AND t1.ID IS NOT NULL
LIMIT 1
This will always give you an id that you can use, its just not guaranteed to always be the smallest one.

Categories