I have a dataset of rows each with an 'odds' number between 1 and 100. I am looking to do it in the most efficient way possible. The odds do not necessarily add up to 100.
I have had a few ideas.
a)
Select the whole dataset and then add all the odds up and generate a random number between 1 and that number. Then loop through the dataset deducting the odds from the number until it is 0.
I was hoping to minimize the impact on the database so I considered if I could only select the rows I needed.
b)
SELECT * FROM table WHERE (100*RAND()) < odds
I considered LIMIT 0,1
But then if items have the same probability only one of the will be returned
Alternatively take the whole dataset and pick a random one from there... but then the odds are affected as it becomes a random with odds and then a random without odds thus the odds become tilted in favour of the higher odds (even more so).
I guess I could order by odds ASC then take the whole dataset and then with PHP take a random out of the rows with the same odds as the first record (the lowest).
Seems like a clumsy solution.
Does anyone have a superior solution? If not which one of the above is best?
Do some up-front work, add some columns to your table that help the selection. For example suppose you have these rows
X 2
Y 3
Z 1
We add some cumulative values
Key Odds Start End
X 2 0 1 // range 0->1, 2 values == odds
Y 3 2 4 // range 2->4, 3 values == odds
Z 1 5 5 // range 5->5, 1 value == odds
Start and End are chosen as follows. The first row has a start of zero. Subsequent rows have a start one more than previous end. End is the (Start + Odds - 1).
Now pick a random number R in the range 0 to Max(End)
Select * from T where R >= T.Start and R <= T.End
If the database is sufficiently clever we may we be able to use
Select * from T where R >= T.Start and R <= (T.Start + T.Odds - 1)
I'm speculating that having an End column with an index may give the better performance. Also the Max(End) perhaps gets stashed somewhere and updated by a trigger when ncessary.
Clearly there's some hassle in updating the Start/End. This may not be too bad if either
The table contents are stable
or insertions are in someway naturally ordered, so that each new row just continues from the old highest.
What if you took your code, and added an ORDER BY RAND() and LIMIT 1?
SELECT * FROM table WHERE (100*RAND()) < odds ORDER BY RAND() LIMIT 1
This way, even if you have multiples of the same probability, it will always come back randomly ordered, then you just take the first entry.
select * from table
where id between 1 and 100 and ((id % 2) <> 0)
order by NewId()
Hmm. Not entirely clear what result you want, so bear with me if this is a bit crazy. That being said, how about:
Make a new table. The table is a fixed data table, and looks like this:
Odds
====
1
2
2
3
3
3
4
4
4
4
etc,
etc.
Then join from your dataset to that table on the odds column. You'll get as many rows back for each row in your table as the given odds of that row.
Then just pick one of that set at random.
If you have an index on the odds column, and a primary key, this would be very efficient:
SELECT id, odds FROM table WHERE odds > 0
The database wouldn't even have to read from the table, it would get everything it needed from the odds index.
Then, you'll select a random value between 1 and the number of rows returned.
Then select that row from the array of rows returned.
Then, finally, select the whole target row:
SELECT * FROM table WHERE id = ?
This assures an even distribution between all rows with an odds value.
Alternatively, put the odds in a different table, with an autoincrement primary key.
Odds
ID odds
1 4
2 9
3 56
4 12
Store the ID foreign key in the main table instead of the odds value, and index it.
First, get the max value. This never touches the database. It uses the index:
SELECT MAX(ID) FROM Odds
Get a random value between 1 and the max.
Then select the record.
SELECT * FROM table
JOIN Odds ON Odds.ID = table.ID
WHERE Odds.ID >= ?
LIMIT 1
This will require some maintenance if you tend to delete Odds value or roll back inserts to keep the distribution even.
There is a whole chapter on random selection in the book SQL Antipatterns.
I didn't try it, but maybe something like this (with ? a random number from 0 to SUM(odds) - 1)?
SET #prob := 0;
SELECT
T.*,
(#prob := #prob + T.odds) AS prob
FROM table T
WHERE prob > ?
LIMIT 1
This is basically the same as your idea a), but entirely within one (well, technically two if you count the variable set-up) SQL commands.
A general solution, suitable for O(log(n)) updates, is something like this:
Store objects as leaves of a (balanced) tree.
At each branch node, store the weights of all objects under it.
When adding, removing, or modifying nodes, update weights of their parents.
Then pick a number between 0 and (total weight - 1) and navigate down the tree until you find the right object.
Since you don't care about the order of things in the tree, you can store them as an array of N pointers and N-1 numbers.
Related
We have records with a count field on an unique id.
The columns are:
mainId = unique
mainIdCount = 1320 (this 'views' field gets a + 1 when the page is visited)
How can you insert all these mainIdCount's as seperate records in another table IN ANOTHER DBASE in one query?
Yes, I do mean 1320 times an insert with the same mainId! :-)
We actually have records that go over 10,000 times an id. It just has to be like this.
This is a weird one, but we do need the copies of all these (just) counts like this.
The most straightforward way to this is with a JOIN operation between your table, and another row source that provides a set of integers. We'd match each row from our original table to as many rows from the set of integer as needed to satisfy the desired result.
As a brief example of the pattern:
INSERT INTO newtable (mainId,n)
SELECT t.mainId
, r.n
FROM mytable t
JOIN ( SELECT 1 AS n
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
) r
WHERE r.n <= t.mainIdCount
If mytable contains row mainId=5 mainIdCount=4, we'd get back rows (5,1),(5,2),(5,3),(5,4)
Obviously, the rowsource r needs to be of sufficient size. The inline view I've demonstrated here would return a maximum of five rows. For larger sets, it would be beneficial to use a table rather than an inline view.
This leads to the followup question, "How do I generate a set of integers in MySQL",
e.g. Generating a range of numbers in MySQL
And getting that done is a bit tedious. We're looking forward to an eventual feature in MySQL that will make it much easier to return a bounded set of integer values; until then, having a pre-populated table is the most efficient approach.
This question already has answers here:
Doing a while / loop to get 10 random results
(3 answers)
Closed 9 years ago.
I have this table (PERSONS) with 25M rows:
ID int(10) PK
points int(6) INDEX
some other columns
I want to show the user 4 random rows which are somewhat close to each other in points. I found this query after some searching and tuning to generate random rows which is impressive fast:
SELECT person_id, points
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id>= r2.id and points > 0
ORDER BY r1.person_id ASC
LIMIT 4
So I query this in the PHP. Which gives me great and fast results (below 0.05 seconds when warmed up). But these rows are really just random (with at least 1 point since the points > 0). I would like to show some rows which are a little bit close, doesn't have to be every time, but let's say I do this query with limit 50 and than select a random row in PHP and the 3 closest rows (based on points) next to it. I would think you would need to sort the result, pick a random row and show the rows after/before it. But i have no idea how I can make this, since I am quite new to PHP.
Anyone suggestions, all feedback is welcome :)
Build an index on your points column (if it does not already exist), then perform your randomisation logic on that:
ALTER TABLE persons ADD INDEX (points);
SELECT person_id, points
FROM persons JOIN (
SELECT RAND() * MAX(points) AS pivot
FROM persons
WHERE points > 0
) t ON t.pivot <= points
ORDER BY points
LIMIT 4
Note that this approach will select the pivot using a uniform probability distribution over the range of points values; if points are very non-uniform, you can end up pivoting on some values a lot more often than others (thereby resulting in seemingly "non-random" outcomes).
To resolve that, you can select a random record by a more uniformly distributed column (maybe person_id?) and then use the points value of that random record as the pivot; that is, substitute the following for the subquery in the above statement:
SELECT points AS pivot
FROM persons JOIN (
SELECT FLOOR(
MIN(person_id)
+ RAND() * (MAX(person_id)-MIN(person_id))
) AS random
FROM persons
WHERE points > 0
) r ON r.random <= person_id
WHERE points > 0
ORDER BY person_id
LIMIT 1
Removing a subquery from it will drasticly improve the performance and caching so you could for example get list your IDs, put it in a file and then random from it (for example by reading random lines from file). This will improve it by a whole lot, as you can see if you will run EXPLAIN on this query and compare it by changing the query to load just data for the 4 (still random) ids.
I would suggest doing two separate sql queries in PHP and not join/subquery them. In many cases the optimizer can not simplify your query and has to perform each one separatly. So, in your case. if you have 1000 persons the optimizer will do the following wueries at worst case:
Get 1000 persons rows
Do Sub Select for each person which get's 1000 persons rows
Join 1000 persons with joined rows resulting in 1.000.000 rows
Filter all of them
In short:
1001 queries with 1.000.000 rows
My advice?
Perform two queries and NO joins or sub-selects as both (especially in combination have dramatic performance drops in most cases)
SELECT person_id, points
FROM persons
ORDER BY RAND() LIMIT 1
Now use the found points for your second query
SELECT person_id, points, ABS(points - <POINTS FROM ABOVE>) AS distance
FROM persons
ORDER BY distance ASC LIMIT 4
Im facing a unique challenge.
I got a table with 100 numbers called HUNDREDNUMBERS.
I want to select the best quarter (75 to 100 numbers),
and place them into another table called BESTQUARTER.
I also want to select the worst quarter (1 to 25 numbers)
I want to place these into another table called WORSTQUARTER.
here's my Mysql code, so far,
$Extract_Data = "
CREATE TABLE $BESTQUARTER
SELECT
HUNDREDNUMBERS.number
FROM
HUNDREDNUMBERS order by
HUNDREDNUMBERS.number desc LIMIT 25 ";
$QuerySuccess = mysql_query($Extract_Data, $connection);
and for the other table....
$Extract_Data = "
CREATE TABLE $WORSTQUARTER
SELECT
HUNDREDNUMBERS.number
FROM
HUNDREDNUMBERS order by
HUNDREDNUMBERS.number asc LIMIT 25 ";
$QuerySuccess = mysql_query($Extract_Data, $connection);
The problem is that this script is not 100% correct every time.
Notice the ASC and the DESC in the two queries.
It's an ingenious way of trying to sort the numbers.
BTW, some of the numbers in the HUNDREDNUMBERS table have decimal points.
I need the data in the two new tables BESTQUARTER and WORSTQUARTER for further processing.
Any help is greatly appreciated
You're doing string comparisons and those follow different rules than numeric data types; I would suggest to change your sort expressions:
ORDER BY CAST(HUNDREDNUMBERS.number AS UNSIGNED) DESC|ASC
Instead of UNSIGNED you could also use SIGNED or DECIMAL(M, N) if you need to support negative numbers or floating points respectively.
Alternatively (and preferably), you could change the number column to a type that sorts properly by itself; VARCHAR should mostly be used for text.
You should check the data types. Make sure the the numbers are stored as at least a decimal. Other data types can cause the sorting to be off (and is a quite common mistake). It seems simple, but your code actually looks to be correct from what my understanding is of the question.
If you have only 100 numbers, I would suggest that you create a view with a rank, and use that for subsequent processing. Using intermediate tables seems like overkill:
select hn.*,
(select count(*) from hundrednumbers hn2 where hn2.number <= hn.number
) as rank
from HundredNumbers hn
With an index on hundrednumbers(number), this will even have decent performance.
It is possible that the problem you are encountering is duplicates in the original data. If so, looking at the ranks can help you figure out what to do in this situation.
After long hours of thinking and testing, i believe i finally cracked it.
1) I changed the fieldname "numbers" to DOUBLE UNSIGNED.
(initially i was using VARCHAR(50) )
2) Whenever you are using two or more tables that have the same field names, prefix EVERY fieldname with its tablename.
I did that and it worked, as you shall see in the full query below.
3) the original data had multiple occurrences of the same numbers,
ie there were several instances of rows with the value 100.
MySQL transferred only a single row with the value 100, into the table BESTQUARTER. (i don't know why).
uniqueid | id | numbers
1 200 100
2 6 100
3 76 100
4 64 99.009987655
5 10 95.98765432
6 11 11.98765432
7 12 25.12
8 13 53.173543
9 153 72.87676
10 32 99
So i added "GROUP By" and used the ID field.
(nb: "uniqueid" column is the primary key, "id" is a unique key that uniquely identifies each number)
Here's the new code
create table BESTQUATER
select
HUNDREDNUMBERS.uniqueid ,
HUNDREDNUMBERS.id,
HUNDREDNUMBERS.numbers
FROM
HUNDREDNUMBERS
group by HUNDREDNUMBERS.id
ORDER BY HUNDREDNUMBERS.numbers DESC LIMIT 25
i was using order by rand() to generate random rows from database without any issue but i reaalised that as the database size increase this rand() causes heavy load on server so i was looking for an alternative and i tried by generating one random number using php rand() function and put that as id in mysql query and it was very very fast since mysql was knowing the row id
but the issue is in my table all numbers are not availbale.for example 1,2,5,9,12 like that.
if php rand() generate number 3,4 etc the query will be blank as there is no id with number 3 , 4 etc.
what is the best way to generate random numbers preferable from php but it should generate the available no in that table so it must check that table.please advise.
$id23=rand(1,100000000);
SELECT items FROM tablea where status='0' and id='$id23' LIMIT 1
the above query is fast but generate sometimes no which is not availabel in database.
SELECT items FROM tablea where status=0 order by rand() LIMIT 1
the above query is too slow and causes heavy load on server
First of, all generate a random value from 1 to MAX(id), not 100000000.
Then there are at least a couple of good solutions:
Use > not =
SELECT items FROM tablea where status='0' and id>'$id23' LIMIT 1
Create an index on (status,id,items) to make this an index-only query.
Use =, but just try again with a different random value if you don't find a hit. Sometimes it will take several tries, but often it will take only one try. The = should be faster since it can use the primary key. And if it's faster and gets it in one try 90% of the time, that could make up for the other 10% of the time when it takes more than one try. Depends on how many gaps you have in your id values.
Use your DB to find the max value from the table, generate a random number less than or equal to that value, grab the first row in which the id is greater than or equal to your random number. No PHP necessary.
SELECT items
FROM tablea
WHERE status = '0' and
id >= FLOOR(1 + RAND() * (SELECT MAX(id) FROM tablea))
LIMIT 1
You are correct, ORDER BY RAND() is not good solution if you are dealing with large datasets. Depending how often it needs to be randomized, what you can do is generate a column with a random number and then update that number at some predefined interval.
You would take that column and use it as your sort index. This works well for a heavy read environment and produces predicable random order for a certain period of time.
A possible solution is to use limit:
$id23=rand(1,$numberOfRows);
SELECT items FROM tablea where status='0' LIMIT $id23 1
This wont produce any missed rows (but as hek2mgl mentioned) requires knowing the number of rows in the select.
I have got table with 300 000 rows. There is specially dedicated field (called order_number) in this table to story the number, which is later used to present the data from this table ordered by order_number field. What is the best and easy way to assign the random number/hash for each of the records in order to select the records ordered by this numbers? The number of rows in the table is not stable and may grow to 1 000 000, so the rand method should take it into the account.
Look at this tutorial on selecting random rows from a table.
If you don't want to use MySQL's built in RAND() function you could use something like this:
select max(id) from table;
$random_number = ...
select * from table where id > $random_number;
That should be a lot quicker.
UPDATE table SET order_number = sha2(id)
or
UPDATE table SET order_number = RAND(id)
sha2() is more random than RAND().
I know you've got enough answers but I would tell you how we did it in our company.
The first approach we use is with additional column for storing random number generated for each record/row. We have INDEX on this column, allowing us to order records by it.
id, name , ordering
1 , zlovic , 14
2 , silvas , 8
3 , jouzel , 59
SELECT * FROM table ORDER BY ordering ASC/DESC
POS: you have index and ordering is very fast
CONS: you will depend on new records to keep the randomization of the records
Second approach we have used is what Karl Roos gave an his answer. We retrieve the number of records in our database and using the > (greater) and some math we retrieve rows randomized. We are working with binary ids thus we need to keep autoincrement column to avoid random writings in InnoDB, sometimes we perform two or more queries to retrieve all of the data, keeping it randomized enough. (If you need 30 random items from 1,000,000 records you can run 3 simple SELECTs each for 10 items with different offset)
Hope this helps you. :)