which is more efficient (when managing over 100K records):
A. Mysql
SELECT * FROM user ORDER BY RAND();
of course, after that i would already have all the fields from that record.
B. PHP
use memcached to have $cache_array hold all the data from "SELECT id_user FROM user ORDER BY id_user" for 1 hour or so... and then:
$id = array_rand($cache_array);
of course, after that i have to make a MYSQL call with:
SELECT * FROM user WHERE id_user = $id;
so... which is more efficient? A or B?
The proper way to answer this kind of question is to do a benchmark. Do a quick and dirty implementation each way and then run benchmark tests to determine which one performs better.
Having said that, ORDER BY RAND() is known to be slow because it's impossible for MySQL to use an index. MySQL will basically run the RAND() function once for each row in the table and then sort the rows based on what came back from RAND().
Your other idea of storing all user_ids in memcached and then selecting a random element form the array might perform better if the overhead of memcached proves to be less than the cost of a full table scan. If your dataset is large or staleness is a problem, you may run into issues though. Also you're adding some complexity to your application. I would try to look for another way.
I'll give you a third option which might outperform both your suggestions: Select a count(user_id) of the rows in your user table and then have php generate a random number between 0 and the result of count(user_id) minus 1, inclusive. Then do a SELECT * FROM user LIMIT 1 OFFSET random-number-generated-by-php;.
Again, the proper way to answer these types of questions is to benchmark. Anything else is speculation.
The first one is incredibly slow because
MySQL creates a temporary table with
all the result rows and assigns each
one of them a random sorting index.
The results are then sorted and
returned.
It's elaborated more on this blog post.
$random_no = mt_rand(0, $total_record_count);
$query = "SELECT * FROM user ORDER BY __KEY__ LIMIT {$random_no}, 1";
Related
This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..
Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.
Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.
First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.
You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().
I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).
The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.
I have a MySQL database that contains over 400,000 rows. For my web based script, I have a page function. One of the steps to determine how many pages there should be is returning the number of rows in the table.
Let's pretend the table name is data.
I'm wondering what is the most efficient method to ONLY return the number of rows in the database.
I could obviously do something like:
$getRows = mysql_query("SELECT id FROM `data`") or die(mysql_error());
$rows = mysql_num_rows($getRows);
So that it only selects the id. But still, that will be selecting 400,000 + ID's worth of data and storing it on the stack (i think?) and seems less efficient as using a method such as finding the table status. I'm just not 100% sure how to use the table status method.
Feedback & opinions would be awesome. Thanks guys!
use count
SELECT count(id) FROM data
See this question for more info on getting counts. Make sure your id has an index in your table.
Now, to find the number of unique rows, you can do
SELECT count(distinct(id)) FROM data
alternatively, if you want to find the highest ID number (if you ID are autoincremental and unique) you can try SELECT max(id) FROM data to return the highest ID number present.
I'd highly recommend this site to learn these basic functions:
http://sqlzoo.net/
400,000 rows is not a lot at all. Keep it simple and just do:
select count(*)
from `data`
I'm making a micro-blogging website. The users can follow each other. I've to make stream of posts (activity stream) for the current user ( $userid ) based on the users the current user is following, like in Twitter. I know two ways of implementing this. Which one is better?
Tables:
Table: posts
Columns: PostID, AuthorID, TimeStamp, Content
Table: follow
Columns: poster, follower
The first way, by joining these two tables:
select `posts`.* from `posts`,`follow` where `follow`.`follower`='$userid' and
`posts`.`AuthorID`=`follow`.`poster` order by `posts`.`postid` desc
The second way is by making an array of users the $userid is following (posters), then doing php implode on this array, and then doing where in:
One thing I'll like to tell here that I'm storing the the number of users a user is following in the `following` record of the `user` table, so here I'll use this number as a limit when extracting the list of posters - the 'followingList':
function followingList($userid){
$listArray=array();
$limit="select `following` from `users` where `userid`='$userid' limit 1";
$limit=mysql_query($limit);
$limit=mysql_fetch_row($limit);
$limit= (int) $limit[0];
$sql="select `poster` from `follow` where `follower`='$userid' limit $limit";
$result=mysql_query($sql);
while($data = mysql_fetch_row($result)){
$listArray[] = $data[0];
}
$posters=implode("','",$listArray);
return $posters;
}
Now I've a comma separated list of user IDs the current $userid is following.And now selecting the posts to make the activity stream:
$posters=followingList($userid);
$sql = "select * from `posts` where (`AuthorID` in ('$posters'))
order by `postid` desc";
Which of the two methods is better?
And can knowing the total number of following (number of users the current user is following), make things faster in the first method as it's doing in the second method?
Any other better method?
You should go all the way with the first option. Always try as much as possible to process the data on the mysql server instead of in your PHP code. PHP will not implicitly cache the results of the operations while MySQL will do it.
The most important thing is to make sure you index your data correctly. Try using "EXPLAIN" statements to make sure you have optimized your database as much as possible and use #1 to link your data together.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
This will allow you later to compute statistics also, while the second method requires you to process a part of the statistics.
The first important point is that PHP is good at building pages but very bad are managing data, everything manipulated by PHP will fill the memory and no special behavior can be applied in PHP to prevent using to much memory, except crashing.
On the other side the datatase job is to analyse relation between the tables, real number used by the query (cardinality of indexes and statictics on rows and index usage in fact), and a lot of different mechanism can be choosen by the engine depending on the size of data (merge joins, temporary tables, etc). That means you could have 256.278.242 posts and 145.268 users, with 5.684 average followers the datatabase job would be to find the fastest way to give you an answer. Well, when you hit really big numbers you'll see that all databases are not equal, but that's another problem.
On the PHP side Retrieving the list of users from the fisrt query coudl became very long (with a big number of followed users, let's say 15.000. Simply building the query string with 15 000 identifiers inside would take a quite big amount a memory. Trasnferring this new query to the SQL server would also be slow. It's definitively the wrong way.
Now be careful of the way you build your SQL request. A request is something you should be able to read from the top to the end, explaining what you really want. This will help the SQL (good) engine in choosing the right solution.
select `posts`.*
from `posts`
INNER JOIN `follow` ON posts`.`AuthorID`=`follow`.`poster`
where `follow`.`follower`='#userid'
order by `posts`.`postid` desc
LIMIT 15
Several remarks:
I have used an INNER JOIN.I want an INNER JOIN, let's write it, it will be easier to read for me later and it should be the same for the query analyser.
if #userid is an int do not use quotes. Please use ints for identifiers (this is really faster than strings). And on the PHP side cast the int "SELECT ..." . (int) $user_id ." ORDER ... or use query with parameters (This is for security).
I have used a LIMIT 15, maybe an offset could be used as well, if you want to show some pagination control around the posts. Let's say this query will retrieve 15.263 documents from my 5.642 folowwed users, you do not want, and the user do not want, to show theses 15.263 documents on a web page. And knowing with $limit that the number is 15.263 is a good thing but certainly not for a request limit. You know this number, but the database may know it as well if it has a good query analyser and some good internal statistics.
The request limit has several goals
1. Limit the size of data transfered from the database to your PHP script
2. Limit the memory usage of your PHP script (an array with 15.263 documents containg some HTMl stuff... ouch)
3. Limit the size of the final user output (and get a faster response)
This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..
Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.
Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.
First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.
You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().
I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).
The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.
Please note I am a beginner to this.
I have two questions:
1) How can I order the results of a query randomly.
example query:
$get_questions = mysql_query("SELECT * FROM item_bank_tb WHERE item_type=1 OR item_type=3 OR item_type=4");
2) The best method to select random rows from a table. So lets say I want to grab 10 rows at random from a table.
Many thanks,
If you don't mind sacrificing complexity on the insert/update/delete operations for speed on the select, you can always add a sequence number and make sure it's maintained on insert/update/delete, then whenever you do a select, simply select on one or more random numbers from within this range. If the "sequence" column is indexed, I think that's about as fast as you'll get.
An alternative is "shuffling". Add a sequence column, insert random values into this column, and whenever you select records, order by the sequence column and update the selected record sequences to new random values. The update should only affect the records you've retrieved, so shouldn't be too costly ... but it may be worth running some tests against your dataset.
This may be a fairly evil thing to say, but I'll say it anyway ... is there ever a need to display 'random' data? if you're trying to display random records, you may be doing something wrong.
Think about Amazon ... do they display random products, or do they display popular ones, and 'things other people bought when they looked at this'. Does SO give you a list of random questions to the right of here, or a list of related ones? Just some food for thought.
SELECT * FROM item_bank_tb WHERE item_type in(1,3,4) order by rand() limit 10
Beware that order by rand() is very slow on large recordset.
EDIT. Take a look at this very interesting article that presents a different approach.
http://explainextended.com/2009/03/01/selecting-random-rows/