I want to display 10 lines of the "questions" table with probability 0.2 of question that have type_id = 1 and probability 0.8 of question that have type_id =2.
Below my request, how to add the probability?
$query = "select * from questions ORDER BY RAND() LIMIT 10";
I want to display 10 questions which 20% of the questions have type_id = 2 and 80% have type_id = 1.
can someone help me please?
As I noted in the comments, you won't be able to use anything as obvious as ORDER BY RAND() if you want to include probabilities or anything like that. ORDER BY RAND() simply doesn't support that kind of thing. ORDER BY RAND() is also very slow, and not really suitable for use on a database of any significant size anyway.
There are a whole bunch of approaches you can use to do a random sort order with weighting or probabilities; I'm not going to try to discuss them all; I'll just give you a relatively simple one, but please be aware that the best technique for you will depend on your specific use case.
A simple approach would be something like this:
Create a new integer field on your table called weight or something similar.
Add a DB index for this field to enable you to query it quickly.
Set the first record to a value equal to its weighting as a whole number. ie a probability of 0.2 could be a weight of 20.
Set each subsequent record to the max value of this field plus the weight for that record. So if the second record is also 0.2, it would get a value of 40; if the one after that is only 0.1, it would be 50; and so on.
Do likewise for any new records that get added.
Now you can select a random record, with different weights for each record, as follows:
SELECT * FROM questions
WHERE weight >= FLOOR(RAND() * (SELECT MAX(weight) FROM questions))
ORDER BY weight
LIMIT 1
(note, I'm writing is answer in a hurry and without resource to test it; I haven't run this query so I may have got the syntax wrong, but the basic technique is sound)
This will pick a random number between zero and the largest weight value, and then find the question record that has the closest weight value to that random number.
Also, because the weight field is indexed, this query will be quick and efficient.
Downsides of this technique: It assumes that the weights for any given record won't change. If the weight of a record does need to change, then you would have to update the weight value for every record after it in the index.
[EDIT]
Let's imagine a table like this:
id Name
1 Question One
2 Question Two
3 Question Three
4 Question Four
5 Question Five
In this example, we want Questions 1 and 2 to have a probability of 0.2, question 3 to have a probability of 0.1 and questions 4 and 5 to have a probability of 0.3. Those probabilities can be expressed as integers by multiplying them by 100. (multiply by 10 also works, but 100 means we can have probabilities like 0.15 as well)
We add the weight column and the index for it, and set the weight values as follows:
id Name Weight
1 Question One 20
2 Question Two 40 (ie previous value + 20)
3 Question Three 50 (ie previous value + 10)
4 Question Four 80 (ie previous value + 30)
5 Question Five 110 (ie previous value + 30)
Now we can run our query.
The random part of the query FLOOR(RAND() * (SELECT MAX(weight) FROM questions)) will select a value between zero and 110. Let's imagine it gives 68.
Now the rest of our query says to pick the first record where the weight is greater than 68. In this case, that means that the record we get is record #4.
This gives us our probability because the random number could be anything, but is more likely to select a given record if the gap between its weight and the one before it is larger. You'll get record #4 three times as often as record #3.
Related
I have a list of questions in a category, and want to choose a subset of them to ask the user based on which ones they answered right/wrong previously.
I want to make it random, but in a way that the ones they have more trouble with are asked more frequently.
EDIT: I'm trying to figure out how to calculate the weight/bias/score for each question based on the number of times they've answered it right/wrong.
I came up with the following, but it seems odd to me:
I assign a score to each question based on how many times they answered it right/wrong
Obviously, if they've never been asked that question I need to assign an arbitrary score (I chose 5)
For all other question, I use the formula
score = wrong*2-right
so if I had the following 10 questions, the "score" would be calculated for each of them (R=# of times they got it right, W=# of times they got it wrong and S=score). From there, I take the lowest score and assign that a probability of 1 (in this case it was id=5 with a score of -7). I then take the difference between the lowest score and the second lowest score (id=1 with -5, a difference of 2) and assign it a probability of 1 + the difference = 3.
I continue this for every question, and then at the end I can just choose a random number between Min(1) and Max(82) and select the question that has the highest P where random < P. So if my random # was 79 I would choose id=2.
But this seems long and convoluted. Is there an easier way to do this (I'm using PHP and mysql, But I plan to do this within an app with a local datastore as well)
id R W S P
1 5 0 -5 3
2 3 5 7 82
3 6 2 -2 8
4 2 2 2 23
5 9 1 -7 1
6 3 1 -1 14
7 0 0 5 68
8 7 5 3 33
9 6 5 4 44
10 3 4 5 56
EDIT: to clarify, I'm stuck on the issue of "weight" (P value in my example)...I'm trying to find a good (and fast) way of calculating the "weight" for each problem, given the number of right and wrong answers they've given for the question
I am not sure if I understand your answer correctly but it seems you are looking for a sort of "weighted" random number generator. In essense what you want to do is give the problems they are having issues with more weight. Perhaps create a class called questions with a property of weight in it. That property can hold how much weight you put in it. Then when you select a random number generator use something like this.
http://codetheory.in/weighted-biased-random-number-generation-with-javascript-based-on-probability/
After doing some research, I realize that my initial method of calculating a weight is bit slow. After using the formula, I end up with some -ve weights. I then have to go through each one and add ABS(MIN(S)) to each weight, which is unnecessary.
My new formula would be S = CEILING(Wrong * 5 / Right)
Obviously I'd need to account for 0 values, so the code would be:
if (R == 0 AND W == 0) S = 10
else if (R == 0) S = W*5
else if (W == 0) S = CEILING(5/R)
else S = CEILING(W * 5 / R)
I've worked out the numbers for a few sample sets and this gives me fairly good results. It also allows me to keep the SCORE value updated in the database, so it doesn't need to be recalculated every time (just updated whenever that question is answered)
Once I have a set of 60 or so questions and I want to choose 5 or 10 of them, I can just create a random # between 1-SUM(SCORE) and then use a binary search to figure out which question that represents.
If anyone has a better suggestion for calculating the score/weight/bias or whatever it's called, I'd appreciate it.
I have this table:
person_id int(10) pk
fid bigint(20) unique
points int(6) index
birthday date index
4 FK columns int(6)
ENGINE = MyISAM
Important info: the table contains over 8 million rows and is fast growing (1.5M a day at the moment)
What I want: to select 4 random rows in a certain range when I order the table on points
How I do it now: In PHP I randomize a certain range, let's say this gives me 20% as low range and 30% as high range. Next I count(*) the number of rows in table. After I determine the lowest row number: table count / 100 * low range. Same for high range. After I calculate a random row by using rand(lowest_row, highest_row), which gives me a row number within the range. And at last I select the random row by doing:
SELECT * FROM `persons` WHERE points > 0 ORDER BY points desc LIMIT $random_offset, 1;
The points > 0 is in the query since I only want randoms with at least 1 point.
Above query takes about 1.5 seconds to run, but since I need 4 rows it takes over 6 seconds, which is too slow for me. I figured the order by points takes the most time, so I was thinking about making a VIEW of the table, but I have really no experience with views, so what do you think? Is a view a good option or are there better solutions?
ADDED:
I forgot to say that it is important that all rows has the same chance of being selected.
Thanks, I appreciate all the help! :)
Kevin
Your query is so slow, and will become exponentially slower, because using LIMIT here forces it to do a full table sort, and then a full table scan, to get the result. Instead you should do this on the PHP end of things as well (this kind of 'abuse' of LIMIT is actually the reason it's non-standard SQL and for example MSSQL and Oracle do not support it).
First ensure there's an index on points. This will make select max(points), min(points) from persons a query that'll return instantly. Next you can determine from those 2 results the points range, and use rand() to determine 4 points in the requested range. Then repeat for each result:
SELECT * FROM persons WHERE points < $myValue ORDER BY points DESC LIMIT 1
Since it only has to retrieve one row, and can determine which one via the index, this'll be in the milliseconds execution time as well.
Views aren't going to do anything to help your performance here. My suggestion would be to simply run:
SELECT * FROM `persons` WHERE points BETWEEN ? AND ?
Make sure you have an index on points. Also, you SHOULD replace * with only the fields you are concerned about if applicable. Here is course ? represents the upper and lower bounds for your search.
You can then determine the number of rows returned in the result set using mysqli_num_rows() (or similar based on your DB library of choice).
You now have the total number of rows that meet your criteria. You can easily then calculate 4 random numbers within the range of results and use mysqli_data_seek() or similar to go directly to the record at the random offset and get the values you want from it.
Putting it all together:
$result = mysqli_query($db_conn, $sql); // here $sql is your SQL query
$num_records = 4; // your number of records to return
$num_rows = mysqli_num_rows($result);
$rows = array();
while ($i = 0; $i < $num_records; $i++) {
$random_offset = rand(0, $num_rows - 1);
mysqli_data_seek($result, $random_offset);
$rows[] = mysqli_fetch_object($result);
}
mysqli_free_result($result);
In PHP, I've a contest question such as "How many people will participate ?". I need to select the 10 closest answers near this total participants.
I've a table called answers with an ID and number field.
Let's say the total participants are 100 and I want 10 results.
I need to select the 10 results where number is closest to 100. It should be above and below 100.
How could I do that ?
Thanks,
Select the (abs(delta))...
select id, number, abs(100 - number) as delta
from mytable
order by delta
limit 0, 10
Something like this.
You can calculate the proximity with the absolute number of the substraction;
$proximity=abs($answer - 100);
the smaller, the closer!
I got a poll on my website and 5 stars rating:
1 star - 1 (worst)
2 stars - 2
3 stars - 3
4 stars - 4
5 stars - 5 (best)
Now, how should I store the poll records in MySQL? How to calculate them?
Default rate value is 5, but if user would rate it 1 star, it should change this value to 1 instead and then start to calculating it somehow... First I need an idea on how to store the votes in my database. You probably have more experience with that.
Store votes in a separate table, this way you will have record on who has voted.
user_id, topic_id , vote, date will be enough for now. Calculating is easy sum all votes divide by the total number of votes related to the topic. This will give you the average . In case you want it to show as 1-5 you can round() it. In order not to do this calculation every time you load a topic you can store it in a field in the topics table and update that field each time you add/remove record from the votes table.
Just store the votes in an integer field (1 to 5) in the table, combined with other info (eg to make sure the user can vote only once).
When you want to show the result, you use the cast votes, eg to calculate an average, or other statistics.
Recalculating (and storing) the statistics after each vote is cast, is also possible but not really required, unless you have much more page views than votes cast then it might result in less resource usage. (This also depending on the complexity of your statical calculations of course)
I'm retrieving 4 random rows from a table. However, I'd like it such that more weight is given to rows that had just been inserted into the table, without penalizing older rows much.
Is there a way to do this in PHP / SQL?
SELECT *, (RAND() / id) AS o FROM your_table ORDER BY o LIMIT 4
This will order by o, where as o is some random integer between 0 and 1 / id, which means, the older your row, the lower it's o value will be (but still in random order).
I think an agreeable solution would be to use an asymptotic function (1/x) in combination with weighting.
The following has been tested:
SELECT *, (Rand()*10 + (1/(max_id - id + 1))) AS weighted_random
FROM tbl1
ORDER BY weighted_random
DESC LIMIT 4
If you want to get the max_id within the query above, just replace max_id with:
(SELECT id FROM tbl1 ORDER BY id DESC LIMIT 1)
Examples:
Let's say your max_id is 1000 ...
For each of several ids I will calculate out the value:
1/(1000 - id + 1) , which simplifies out to 1/(1001 - id):
id: 1000
1/(1001-1000) = 1/1 = 1
id: 999
1/(1001-999) = 1/2 = .5
id: 998
1/(1001-998) = 1/3 = .333
id: 991
1/(1001-991) = 1/10 = .1
id: 901
1/(1001-901) = 1/100 = .01
The nature of this 1/x makes it so that only the numbers close to max have any significant weighting.
You can see a graph of + more about asymptotic functions here:
http://zonalandeducation.com/mmts/functionInstitute/rationalFunctions/oneOverX/oneOverX.html
Note that the right side of the graph with positive numbers is the only part relevant to this specific problem.
Manipulating our equation to do different things:
(Rand()*a + (1/(b*(max_id - id + 1/b))))
I have added two values, "a", and "b"... each one will do different things:
The larger "a" gets, the less influence order has on selection. It is important to have a relatively large "a", or pretty much only recent ids will be selected.
The larger "b" gets, the more quickly the asymptotic curve will decay to insignificant weighting. If you want more of the recent rows to be weighted, I would suggest experimenting with values of "b" such as: .5, .25, or .1.
The 1/b at the end of the equation offsets problems you have with smaller values of b that are less than one.
Note:
This is not a very efficient solution when you have a large number of ids (just like the other solutions presented so far), since it calculates a value for each separate id.
... ORDER BY (RAND() + 0.5 * id/maxId)
This will add half of the id/maxId ration to the random value. I.e. for the newest entry 0.5 is added (as id/maxId = 1) and for the oldest entry nothing is added.
Similarly you can also implement other weighting functions. This depends on how exactly you want to weight the values.