How to sort selected rows (truly) randomly? - php

Currently, I am using this query in my PHP script:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND() LIMIT 125;
The database will be about 2500 rows big at max, but I've read that ORDER BY RAND() eventually will slow down the processing time as the data in the database grows.
So I am looking for an alternate method for my query to make things still run smoothly.
Also, I noticed that ORDER BY RAND() is not truly randomizing the rows, because often I see that it follows some kind of pattern that sometimes repeats over and over again.
Is there any method to truly randomize the rows?

The RAND() function is a pseudo-random number generator and if you do not initialize it with different values will give you the same sequence of numbers, so what you should do is:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND(UNIX_TIMESTAMP()) LIMIT 125;
which will seed the random number generator from the current time and will give you a different sequence of numbers.
RAND() will slow down the SELECT's ORDER BY clause since it has to generate a random number every time and then sort by it. I would suggest you have the data returned to the calling program and randomize it there using something like array_rand.

This question has already been answered:
quick selection of a random row from a large table in mysql
Here too:
http://snippetsofcode.wordpress.com/2011/08/01/fast-php-mysql-random-rows/

Related

paginating random list sql

Im working in a basic pagination list where I need all the results be retrieved ramdomly
this is why I use
SELECT *
FROM table
WHERE 1
ORDER BY rand()
It works well until I need to paginate it...
SELECT *
FROM table
WHERE 1
ORDER BY rand()
LIMIT $offset, $recordsperPage
How do I retrieve the whole list in random order, but when paginated, every page do not repeat the prior random words?
If you provide a consistent seed to the rand() call, you'll get the same sequence. You might use something derived from the date, for example, to give the same results for the day, or generate some other random seed from PHP and save it in the session, to give the same results only for a given visitor.

Change the seed of RAND() function in PHP?

I accessed my table of database by a PHP script and I get continuous repeat results sometimes.
I ran this query:
$query ="SELECT Question,Answer1,Answer2 FROM MyTable ORDER BY RAND(UNIX_TIMESTAMP(NOW())) LIMIT 1";
Before of this query, I tried just with ORDER BY RAND(), but it gave me a lot of continuous repeat results, that's why I decided to use ORDER BY RAND(UNIX_TIMESTAMP(NOW())).
But this last one still give me continuous repeat results( but less).
Im going to write a example to explain what I mean when I said "continuous repeat results" :
Image that I have 100 rows in my table: ROW1,ROW2,ROW3,ROw4,ROW5...
well, when I call my script PHP 5 times continuosly I get 5 results:
-ROW2,ROW20,ROW20,ROW50,ROW66
I don't want same row continuously two times.
I would like it for example: -ROW2,ROW20,ROW50,ROW66,ROW20
I just want to fix it some easy way.
https://dev.mysql.com/doc/refman/5.7/en/mathematical-functions.html#function_rand
RAND() is not meant to be a perfect random generator. It is a fast way
to generate random numbers on demand that is portable between
platforms for the same MySQL version.
If you want 5 results, why not change the limit to 5 ? This will ensure that there are no duplicates
The other option is read all of the data out, and then use shuffle in php ?
http://php.net/manual/en/function.shuffle.php
Or select the max and use a random number generated from PHP
http://php.net/manual/en/function.mt-rand.php
This is not doable by just redefining the query. You need to change the logic of your PHP script.
If you want that the PHP script (and the query) returns exactly ONE row per execution, and you need a guarantee that repeated executions of the PHP scrips yield different rows, then you need to store the previous result somewhere, and use the previous result in the WHERE condition of the query.
So your PHP script becomes something like (pseudocode):
$previousId = ...; // Load the ID of the row fetched by the previous execution
$query = "SELECT Question,Answer1,Answer2
FROM MyTable
WHERE id <> ?
ORDER BY RAND(UNIX_TIMESTAMP(NOW()))
LIMIT 1";
// Execute $query, using the $previousId bound parameter value
$newId = ...; // get the ID of the fetched row.
// Save $newId for the next execution.
You may use all kinds of storages for saving/loading the ID of the fetched rows. The easiest is probably to use a special table with a single row in the same database for this purpose.
Note that you may still get repeated sequential rows if you call your PHP script many times in parallel. Not sure if it matters in your case.
If it does, you may use locks or database transactions to fix this as well.

Select a random chunk of 30 numbers from 200 numbers

Currently what I'm trying to do is get a chunk of 30 numbers from a set of 200. For example as this will be used with MySQL, I want to select 30 random images from a database of 200 images. I'd like to be able to have two numbers (a high and low) I could use in the limit statement that would return 30 rows of data like this: SELECT * FROM 'images' LIMIT 20,50 or SELECT * FROM 'images' LIMIT 10,40. I know this probably sounds like a stupid question though my brain is just kinda stumped right now. All help is greatly appreciated! Thanks :)
Simply add ORDER BY RAND() to your query. It is "sufficiently" random.
SELECT FOO FROM BAR ORDER BY RAND() LIMIT 30
Using ORDER BY RAND() is considered an antipattern because you're forcing the database to perform a full table scan and expensive sort operation.
To benefit from query caching, you could make a hybrid solution like:
// take all image ids
$keys = $db->query('SELECT image_id FROM images')->fetchALL(PDO::FETCH_COLUMN, 0);
// pick 30
$random_keys = join(',', array_rand(array_flip($keys), 30));
// query those images only
$values = $db->query("SELECT * FROM images WHERE image_id IN ($random_keys)")->fetchAll(PDO::FETCH_ASSOC);
The above query for the keys can be cached in PHP, so it can be used for more frequent requests. However, when your table becomes in the range of 100k or more, I would suggest creating a separate table with the image ids in a randomized order that you can join against the images table. You can populate this once or a few times per day using the ORDER BY RAND().
I would suggest using PHP's array_rand().
http://php.net/manual/en/function.array-rand.php
Put whatever you want to choose from in an array, and let it pick 20 entries for you. Then, you can use whatever file names you want without having to rely on them being in numerical order.

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.
From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.
An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.
You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.

MySQL vs PHP when retrieving a random item

which is more efficient (when managing over 100K records):
A. Mysql
SELECT * FROM user ORDER BY RAND();
of course, after that i would already have all the fields from that record.
B. PHP
use memcached to have $cache_array hold all the data from "SELECT id_user FROM user ORDER BY id_user" for 1 hour or so... and then:
$id = array_rand($cache_array);
of course, after that i have to make a MYSQL call with:
SELECT * FROM user WHERE id_user = $id;
so... which is more efficient? A or B?
The proper way to answer this kind of question is to do a benchmark. Do a quick and dirty implementation each way and then run benchmark tests to determine which one performs better.
Having said that, ORDER BY RAND() is known to be slow because it's impossible for MySQL to use an index. MySQL will basically run the RAND() function once for each row in the table and then sort the rows based on what came back from RAND().
Your other idea of storing all user_ids in memcached and then selecting a random element form the array might perform better if the overhead of memcached proves to be less than the cost of a full table scan. If your dataset is large or staleness is a problem, you may run into issues though. Also you're adding some complexity to your application. I would try to look for another way.
I'll give you a third option which might outperform both your suggestions: Select a count(user_id) of the rows in your user table and then have php generate a random number between 0 and the result of count(user_id) minus 1, inclusive. Then do a SELECT * FROM user LIMIT 1 OFFSET random-number-generated-by-php;.
Again, the proper way to answer these types of questions is to benchmark. Anything else is speculation.
The first one is incredibly slow because
MySQL creates a temporary table with
all the result rows and assigns each
one of them a random sorting index.
The results are then sorted and
returned.
It's elaborated more on this blog post.
$random_no = mt_rand(0, $total_record_count);
$query = "SELECT * FROM user ORDER BY __KEY__ LIMIT {$random_no}, 1";

Categories