This question already has answers here:
Doing a while / loop to get 10 random results
(3 answers)
Closed 9 years ago.
I have this table (PERSONS) with 25M rows:
ID int(10) PK
points int(6) INDEX
some other columns
I want to show the user 4 random rows which are somewhat close to each other in points. I found this query after some searching and tuning to generate random rows which is impressive fast:
SELECT person_id, points
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id>= r2.id and points > 0
ORDER BY r1.person_id ASC
LIMIT 4
So I query this in the PHP. Which gives me great and fast results (below 0.05 seconds when warmed up). But these rows are really just random (with at least 1 point since the points > 0). I would like to show some rows which are a little bit close, doesn't have to be every time, but let's say I do this query with limit 50 and than select a random row in PHP and the 3 closest rows (based on points) next to it. I would think you would need to sort the result, pick a random row and show the rows after/before it. But i have no idea how I can make this, since I am quite new to PHP.
Anyone suggestions, all feedback is welcome :)
Build an index on your points column (if it does not already exist), then perform your randomisation logic on that:
ALTER TABLE persons ADD INDEX (points);
SELECT person_id, points
FROM persons JOIN (
SELECT RAND() * MAX(points) AS pivot
FROM persons
WHERE points > 0
) t ON t.pivot <= points
ORDER BY points
LIMIT 4
Note that this approach will select the pivot using a uniform probability distribution over the range of points values; if points are very non-uniform, you can end up pivoting on some values a lot more often than others (thereby resulting in seemingly "non-random" outcomes).
To resolve that, you can select a random record by a more uniformly distributed column (maybe person_id?) and then use the points value of that random record as the pivot; that is, substitute the following for the subquery in the above statement:
SELECT points AS pivot
FROM persons JOIN (
SELECT FLOOR(
MIN(person_id)
+ RAND() * (MAX(person_id)-MIN(person_id))
) AS random
FROM persons
WHERE points > 0
) r ON r.random <= person_id
WHERE points > 0
ORDER BY person_id
LIMIT 1
Removing a subquery from it will drasticly improve the performance and caching so you could for example get list your IDs, put it in a file and then random from it (for example by reading random lines from file). This will improve it by a whole lot, as you can see if you will run EXPLAIN on this query and compare it by changing the query to load just data for the 4 (still random) ids.
I would suggest doing two separate sql queries in PHP and not join/subquery them. In many cases the optimizer can not simplify your query and has to perform each one separatly. So, in your case. if you have 1000 persons the optimizer will do the following wueries at worst case:
Get 1000 persons rows
Do Sub Select for each person which get's 1000 persons rows
Join 1000 persons with joined rows resulting in 1.000.000 rows
Filter all of them
In short:
1001 queries with 1.000.000 rows
My advice?
Perform two queries and NO joins or sub-selects as both (especially in combination have dramatic performance drops in most cases)
SELECT person_id, points
FROM persons
ORDER BY RAND() LIMIT 1
Now use the found points for your second query
SELECT person_id, points, ABS(points - <POINTS FROM ABOVE>) AS distance
FROM persons
ORDER BY distance ASC LIMIT 4
Related
Lets start by saying that I cant use INDEXING as I need the INSERT, DELETE and UPDATE for this table to be super fast, which they are.
I have a page that displays a summary of order units collected in a database table. To populate the table an order number is created and then individual units associated with that order are scanned into the table to recored which units are associated with each order.
For the purposes of this example the table has the following columns.
id, UID, order, originator, receiver, datetime
The individual unit quantities can be in the 1000's per order and the entire table is growing to hundreds of thousands of units.
The summary page displays the number of units per order and the first and last unit number for each order. I limit the number of orders to be displayed to the last 30 order numbers.
For example:
Order 10 has 200 units. first UID 1510 last UID 1756
Order 11 has 300 units. first UID 1922 last UID 2831
..........
..........
Currently the response time for the query is about 3 seconds as the code performs the following:
Look up the last 30 orders by by id and sort by order number
While looking at each order number in the array
-- Count the number of database rows that have that order number
-- Select the first UID from all the rows as first
-- Select the last UID from all the rows as last
Display the result
I've determined the majority of the time is taken by the Count of the number of units in each order ~1.8 seconds and then determining the first and last numbers in each order ~1 second.
I am really interested in if there is a way to speed up these queries without INDEXING. Here is the code with the queries.
First request selects the last 30 orders processed selected by id and grouped by order number. This gives the last 30 unique order numbers.
$result = mysqli_query($con, "SELECT order, ANY_VALUE(receiver) AS receiver, ANY_VALUE(originator) AS originator, ANY_VALUE(id) AS id
FROM scandb
GROUP BY order
ORDER BY id
DESC LIMIT 30");
While fetching the last 30 order numbers count the number of units and the first and last UID for each order.
while($row=mysqli_fetch_array($result)){
$count = mysqli_fetch_array(mysqli_query($con, "SELECT order, COUNT(*) as count FROM scandb WHERE order ='".$row['order']."' "));
$firstLast = mysqli_fetch_array(mysqli_query($con, "SELECT (SELECT UID FROM scandb WHERE orderNumber ='".$row['order']."' ORDER BY UID LIMIT 1) as 'first', (SELECT UID FROM barcode WHERE order ='".$row['order']."' ORDER BY UID DESC LIMIT 1) as 'last'"));
echo "<td align= center>".$count['count']."</td>";
echo "<td align= center>".$firstLast['first']."</td>";
echo "<td align= center>".$firstLast['last']."</td>";
}
With 100K lines in the database this whole query is taking about 3 seconds. The majority of the time is in the $count and $firstlast queries. I'd like to know if there is a more efficient way to get this same data in a faster time without Indexing the table. Any special tricks that anyone has would be greatly appreciated.
Design your database with caution
This first tip may seems obvious, but the fact is that most database problems come from badly-designed table structure.
For example, I have seen people storing information such as client info and payment info in the same database column. For both the database system and developers who will have to work on it, this is not a good thing.
When creating a database, always put information on various tables, use clear naming standards and make use of primary keys.
Know what you should optimize
If you want to optimize a specific query, it is extremely useful to be able to get an in-depth look at the result of a query. Using the EXPLAIN statement, you will get lots of useful info on the result produced by a specific query, as shown in the example below:
EXPLAIN SELECT * FROM ref_table,other_table WHERE ref_table.key_column=other_table.column;
Don’t select what you don’t need
A very common way to get the desired data is to use the * symbol, which will get all fields from the desired table:
SELECT * FROM wp_posts;
Instead, you should definitely select only the desired fields as shown in the example below. On a very small site with, let’s say, one visitor per minute, that wouldn’t make a difference. But on a site such as Cats Who Code, it saves a lot of work for the database.
SELECT title, excerpt, author FROM wp_posts;
Avoid queries in loops
When using SQL along with a programming language such as PHP, it can be tempting to use SQL queries inside a loop. But doing so is like hammering your database with queries.
This example illustrates the whole “queries in loops” problem:
foreach ($display_order as $id => $ordinal) {
$sql = "UPDATE categories SET display_order = $ordinal WHERE id = $id";
mysql_query($sql);
}
Here is what you should do instead:
UPDATE categories
SET display_order = CASE id
WHEN 1 THEN 3
WHEN 2 THEN 4
WHEN 3 THEN 5
END
WHERE id IN (1,2,3)
Use join instead of subqueries
As a programmer, subqueries are something that you can be tempted to use and abuse. Subqueries, as show below, can be very useful:
SELECT a.id,
(SELECT MAX(created)
FROM posts
WHERE author_id = a.id)
AS latest_post FROM authors a
Although subqueries are useful, they often can be replaced by a join, which is definitely faster to execute.
SELECT a.id, MAX(p.created) AS latest_post
FROM authors a
INNER JOIN posts p
ON (a.id = p.author_id)
GROUP BY a.id
Source: http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
We have records with a count field on an unique id.
The columns are:
mainId = unique
mainIdCount = 1320 (this 'views' field gets a + 1 when the page is visited)
How can you insert all these mainIdCount's as seperate records in another table IN ANOTHER DBASE in one query?
Yes, I do mean 1320 times an insert with the same mainId! :-)
We actually have records that go over 10,000 times an id. It just has to be like this.
This is a weird one, but we do need the copies of all these (just) counts like this.
The most straightforward way to this is with a JOIN operation between your table, and another row source that provides a set of integers. We'd match each row from our original table to as many rows from the set of integer as needed to satisfy the desired result.
As a brief example of the pattern:
INSERT INTO newtable (mainId,n)
SELECT t.mainId
, r.n
FROM mytable t
JOIN ( SELECT 1 AS n
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
) r
WHERE r.n <= t.mainIdCount
If mytable contains row mainId=5 mainIdCount=4, we'd get back rows (5,1),(5,2),(5,3),(5,4)
Obviously, the rowsource r needs to be of sufficient size. The inline view I've demonstrated here would return a maximum of five rows. For larger sets, it would be beneficial to use a table rather than an inline view.
This leads to the followup question, "How do I generate a set of integers in MySQL",
e.g. Generating a range of numbers in MySQL
And getting that done is a bit tedious. We're looking forward to an eventual feature in MySQL that will make it much easier to return a bounded set of integer values; until then, having a pre-populated table is the most efficient approach.
I'm trying to get 4 random results from a table that holds approx 7 million records. Additionally, I also want to get 4 random records from the same table that are filtered by category.
Now, as you would imagine doing random sorting on a table this large causes the queries to take a few seconds, which is not ideal.
One other method I thought of for the non-filtered result set would be to just get PHP to select some random numbers between 1 - 7,000,000 or so and then do an IN(...) with the query to only grab those rows - and yes, I know that this method has a caveat in that you may get less than 4 if a record with that id no longer exists.
However, the above method obviously will not work with the category filtering as PHP doesn't know which record numbers belong to which category and hence cannot select the record numbers to select from.
Are there any better ways I can do this? Only way I can think of would be to store the record id's for each category in another table and then select random results from that and then select only those record ID's from the main table in a secondary query; but I'm sure there is a better way!?
You could of course use the RAND() function on a query using a LIMIT and WHERE (for the category). That however as you pointed out, entails a scan of the database which takes time, especially in your case due to the volume of data.
Your other alternative, again as you pointed out, to store id/category_id in another table might prove a bit faster but again there has to be a LIMIT and WHERE on that table which will also contain the same amount of records as the master table.
A different approach (if applicable) would be to have a table per category and store in that the IDs. If your categories are fixed or do not change that often, then you should be able to use that approach. In that case you will effectively remove the WHERE from the clause and getting a RAND() with a LIMIT on each category table would be faster since each category table will contain a subset of records from your main table.
Some other alternatives would be to use a key/value pair database just for that operation. MongoDb or Google AppEngine can help with that and are really fast.
You could also go towards the approach of a Master/Slave in your MySQL. The slave replicates content in real time but when you need to perform the expensive query you query the slave instead of the master, thus passing the load to a different machine.
Finally you could go with Sphinx which is a lot easier to install and maintain. You can then treat each of those category queries as a document search and let Sphinx randomize the results. This way you offset this expensive operation to a different layer and let MySQL continue with other operations.
Just some issues to consider.
Working off your random number approach
Get the max id in the database.
Create a temp table to store your matches.
Loop n times doing the following
Generate a random number between 1 and maxId
Get the first record with a record Id greater than the random number and insert it into your temp table
Your temp table now contains your random results.
Or you could dynamically generate sql with a union to do the query in one step.
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
Note: my sql may not be valid, as I'm not a mySql guy, but the theory should be sound
First you need to get number of rows ... something like this
select count(1) from tbl where category = ?
then select a random number
$offset = rand(1,$rowsNum);
and select a row with offset
select * FROM tbl LIMIT $offset, 1
in this way you avoid missing ids. The only problem is you need to run second query several times. Union may help in this case.
For MySQl you can use
RAND()
SELECT column FROM table
ORDER BY RAND()
LIMIT 4
A table with about 70K records is displayed on a site, showing 50 records per page.
Pagination is done with limit offset,50 on the query, and the records can be ordered on different columns.
Browsing the latest pages (so the offset is around 60,000) makes the queries much slower than when browsing the first pages (about 10x)
Is this an issue of using the limit command?
Are there other ways to get the same results?
With large offsets, MySQL needs to browse more records.
Even if the plan uses filesort (which means that all records should be browsed), MySQL optimizes it so that only $offset + $limit top records are sorted, which makes it much more efficient for lower values of $offset.
The typical solution is to index the columns you are ordering on, record the last value of the columns and reuse it in the subsequent queries, like this:
SELECT *
FROM mytable
ORDER BY
value, id
LIMIT 0, 10
which outputs:
value id
1 234
3 57
4 186
5 457
6 367
8 681
10 366
13 26
15 765
17 345 -- this is the last one
To get to the next page, you would use:
SELECT *
FROM mytable
WHERE (value, id) > (17, 345)
ORDER BY
value, id
LIMIT 0, 10
, which uses the index on (value, id).
Of course this won't help with arbitrary access pages, but helps with sequential browsing.
Also, MySQL has certain issues with late row lookup. If the columns are indexed, it may be worth trying to rewrite your query like this:
SELECT *
FROM (
SELECT id
FROM mytable
ORDER BY
value, id
LIMIT $offset, $limit
) q
JOIN mytable m
ON m.id = q.id
See this article for more detailed explanations:
MySQL ORDER BY / LIMIT performance: late row lookups
It's how MySQL deals with limits. If it can sort on an index (and the query is simple enough) it can stop searching after finding the first offset + limit rows. So LIMIT 0,10 means that if the query is simple enough, it may only need to scan 10 rows. But LIMIT 1000,10 means that at minimum it needs to scan 1010 rows. Of course, the actual number of rows that need to be scanned depend on a host of other factors. But the point here is that the lower the limit + offset, the lower that the lower-bound on the number of rows that need to be scanned is...
As for workarounds, I would optimize your queries so that the query itself without the LIMIT clause is as efficient as possible. EXPLAIN is you friend in this case...
I would like to be able to pull back 15 or so records from a database. I've seen that using WHERE id = rand() can cause performance issues as my database gets larger. All solutions I've seen are geared towards selecting a single random record. I would like to get multiples.
Does anyone know of an efficient way to do this for large databases?
edit:
Further Edit and Testing:
I made a fairly simple table, on a new database using MyISAM. I gave this 3 fields: autokey (unsigned auto number key) bigdata (a large blob) and somemore (a medium int). I then applied random data to the table and ran a series of queries using Navicat. Here are the results:
Query 1: select * from test order by rand() limit 15
Query 2: select *
from
test
join
(select round(rand()*(select max(autokey) from test)) as val from test limit 15) as rnd
on
rnd.val=test.autokey;`
(I tried both select and select distinct and it made no discernible difference)
and:
Query 3 (I only ran this on the second test):
SELECT *
FROM (
SELECT #cnt := COUNT(*) + 1,
#lim := 10
FROM test
) vars
STRAIGHT_JOIN
(
SELECT r.*,
#lim := #lim - 1
FROM test r
WHERE (#cnt := #cnt - 1)
AND RAND(20090301) < #lim / #cnt
) i
ROWS: QUERY 1: QUERY 2: QUERY 3:
2,060,922 2.977s 0.002s N/A
3,043,406 5.334s 0.001s 1.260
I would like to do more rows so I can see how query 3 scales, but at the moment, it seems as though the clear winner is query 2.
Before I wrap up this testing and declare an answer, and while I have all this data and the test environment set up, can anyone recommend any further testing?
Try:
select * from table order by rand() limit 15
Another (and possibly more efficient way) would be to join against a set of random values. This should work, if there's some contiguous integer key in the table. Here is how I would do it in postgres (My MySQL is a bit rusty)
select * from table join
(select (random()*maxid)::integer as val from generate_series(1,15)) as rnd
on rand.val=table.id;
where maxid is the highest id in table. If id has an index, then this would mean only 15 index lookup, so its very fast.
UPDATE:
Looks like there no such thing as generate_series in MySQL. My fault. We don't need it actually:
select *
from
table
join
-- this just returns 15 random numbers.
-- I need `table` here only to produce rows for rand()
(select round(rand()*(select max(id) from table)) as val from table limit 15) as rnd
on
rnd.val=table.id;
P.S. If I don't want duplicates returned, I can use (select distinct [...]) in the random generator expression.
Update: Check out the accepted answer in this question. It's pure mySQL and even deals with even distribution.
The problem with id = rand() or anything comparable in PHP is that you can't be sure whether that particular ID still exists. Therefore, you need to work with LIMIT, and that can become slow for large amounts of data.
As an alternative to that, you could try using a loop in PHP.
What the loop does is
Create a random integer number using rand(), with a scope between 0 and the number of records in the database
Query the database whether a record with that ID exists
If it exists, add the number to an array
If it doesn't, go back to step 1
End the loop when the array of random numbers contains the desired number of elements
this method could cause a lot of queries in a fragmented table, but they should be pretty fast to execute. It may be faster than LIMIT rand() in certain situations.
The LIMIT method, as outlined by #Luther, is certainly the simplest code-wise.
You could do a query with all the results or however many limited, then use mysqli_fetch_all followed by:
shuffle($a);
$a = array_slice($a, 0, 15);
For a large dataset doing
select * from table order by rand() limit 15
can be quite time and memory consuming.
If your data records happen to be numbered you can put and index on the numbering colum and do a
select * from table where no >= rand() limit 15
Or even better do the random number generation in your application and do
select * from table where no >= $rand and no <= $rand+15
If your data doesn't change too often, it might be worth to add such a numbering a column to make the selection efficient.
Assuming MySQL supports nested queries and that operations on the primary key are fast, I'd try something like
select * from table where id in (select id from table order by rand() limit 15)