Change the seed of RAND() function in PHP? - php

I accessed my table of database by a PHP script and I get continuous repeat results sometimes.
I ran this query:
$query ="SELECT Question,Answer1,Answer2 FROM MyTable ORDER BY RAND(UNIX_TIMESTAMP(NOW())) LIMIT 1";
Before of this query, I tried just with ORDER BY RAND(), but it gave me a lot of continuous repeat results, that's why I decided to use ORDER BY RAND(UNIX_TIMESTAMP(NOW())).
But this last one still give me continuous repeat results( but less).
Im going to write a example to explain what I mean when I said "continuous repeat results" :
Image that I have 100 rows in my table: ROW1,ROW2,ROW3,ROw4,ROW5...
well, when I call my script PHP 5 times continuosly I get 5 results:
-ROW2,ROW20,ROW20,ROW50,ROW66
I don't want same row continuously two times.
I would like it for example: -ROW2,ROW20,ROW50,ROW66,ROW20
I just want to fix it some easy way.

https://dev.mysql.com/doc/refman/5.7/en/mathematical-functions.html#function_rand
RAND() is not meant to be a perfect random generator. It is a fast way
to generate random numbers on demand that is portable between
platforms for the same MySQL version.
If you want 5 results, why not change the limit to 5 ? This will ensure that there are no duplicates
The other option is read all of the data out, and then use shuffle in php ?
http://php.net/manual/en/function.shuffle.php
Or select the max and use a random number generated from PHP
http://php.net/manual/en/function.mt-rand.php

This is not doable by just redefining the query. You need to change the logic of your PHP script.
If you want that the PHP script (and the query) returns exactly ONE row per execution, and you need a guarantee that repeated executions of the PHP scrips yield different rows, then you need to store the previous result somewhere, and use the previous result in the WHERE condition of the query.
So your PHP script becomes something like (pseudocode):
$previousId = ...; // Load the ID of the row fetched by the previous execution
$query = "SELECT Question,Answer1,Answer2
FROM MyTable
WHERE id <> ?
ORDER BY RAND(UNIX_TIMESTAMP(NOW()))
LIMIT 1";
// Execute $query, using the $previousId bound parameter value
$newId = ...; // get the ID of the fetched row.
// Save $newId for the next execution.
You may use all kinds of storages for saving/loading the ID of the fetched rows. The easiest is probably to use a special table with a single row in the same database for this purpose.
Note that you may still get repeated sequential rows if you call your PHP script many times in parallel. Not sure if it matters in your case.
If it does, you may use locks or database transactions to fix this as well.

Related

How to stop mysqli duplicate on random select [duplicate]

This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..
Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.
Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.
First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.
You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().
I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).
The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.

using count instead of mysql_num_rows?

What is the difference between(performance wise)
$var = mysql_query("select * from tbl where id='something'");
$count = mysql_num_rows($var);
if($count > 1){
do something
}
and
$var = mysql_query("select count(*) from tbl where id='somthing'");
P.S: I know mysql_* are deprecated.
The first version returns the entire result set. This can be a large data volume, if your table is wide. If there is an index on id, it still needs to read the original data to fill in the values.
The second version returns only the count. If there is an index on id, then it will use the index and not need to read in any data from the data pages.
If you only want the count and not the data, the second version is clearer and should perform better.
select * is asking mysql to fetch all data from that table (given the conditions) and give it to you, this is not a very optimizable operation and will result in a lot of data being organised and sent over the socket to PHP.
Since you then do nothing with this data, you have asked mysql to do a whole lot of data processing for nothing.
Instead, just asking mysql to count() the number of rows that fit the conditions will not result in it trying to send you all that data, and will result in a faster query, especially if the id field is indexed.
Overall though, if your php application is still simple, while still being good practice, this might be regarded as a micro-optimization.
I would use the second for 2 reasons :
As you stated, mysql_* are deprecated
if your table is huge, you're putting quite a big amount of data in $var only to count it.
SELECT * FROM tbl_university_master;
2785 row(s) returned
Execution Time : 0.071 sec
Transfer Time : 7.032 sec
Total Time : 8.004 sec
SELECT COUNT(*) FROM tbl_university;
1 row(s) returned
Execution Time : 0.038 sec
Transfer Time : 0 sec
Total Time : 0.039 sec
The first collects all data and counts the number of rows in the resultset, which is performance-intensive. The latter just does a quick count which is way faster.
However, if you need both the data and the count, it's more sensible to execute the first query and use mysql_num_rows (or something similar in PDO) than to execute yet another query to do the counting.
And indeed, mysql_* is to be deprecated. But the same applies when using MySQLi or PDO.
I think using
$var = mysql_query("select count(*) from tbl where id='somthing'");
Is more efficient because you aren't allocating memory based on the number of rows that gets returned from MySQL.
select * from tbl where id='something' selects all the data from table with ID condition.
The COUNT() function returns the number of rows that matches a specified criteria.
For more reading and practice and demonstration please visit =>>> w3schools

How to sort selected rows (truly) randomly?

Currently, I am using this query in my PHP script:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND() LIMIT 125;
The database will be about 2500 rows big at max, but I've read that ORDER BY RAND() eventually will slow down the processing time as the data in the database grows.
So I am looking for an alternate method for my query to make things still run smoothly.
Also, I noticed that ORDER BY RAND() is not truly randomizing the rows, because often I see that it follows some kind of pattern that sometimes repeats over and over again.
Is there any method to truly randomize the rows?
The RAND() function is a pseudo-random number generator and if you do not initialize it with different values will give you the same sequence of numbers, so what you should do is:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND(UNIX_TIMESTAMP()) LIMIT 125;
which will seed the random number generator from the current time and will give you a different sequence of numbers.
RAND() will slow down the SELECT's ORDER BY clause since it has to generate a random number every time and then sort by it. I would suggest you have the data returned to the calling program and randomize it there using something like array_rand.
This question has already been answered:
quick selection of a random row from a large table in mysql
Here too:
http://snippetsofcode.wordpress.com/2011/08/01/fast-php-mysql-random-rows/

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.
From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.
An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.
You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.

Best way to get result count before LIMIT was applied

When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.

Categories