Picking Random Elements Based on Weight

Picking Random Elements Based on Weight - php

Say we have an array of name and weight
Something like
Jane 5
John 3
Dane 0
Doe 1
If weight is 0, the name Dane should show up 1/10th of a time than if the weight is 1
The rest are proportional.
So name Jane will show up 5 more times than Doe
Maximum weight is 10
I am thinking of an efficient algorithm to pick names based on their weight.
The way I currently do is to just translate the weight into a very big array.
So Dane will have an entry. Jane will have 50 entries. And then I pick randomly where each entries have equal chance.
I am using PhP.
I wonder if there is a more efficient way.

You can use the following steps:
Firstly, sum the values of all weight.
Secondly, Generate random number in between 1 and the summation.
Now, we have a random number which is less than or equal of summation of all weight. But we need to have one of the weights. Besides, it should have higher chances to get the higher weight.
We can do this by subtracting all weights one by one from the the generated random number. If the resultant value is non-negative than it may have lower weight. If the resultant value is negative that it should be a higher weighted element.

Related

select RAND() with probability

I want to display 10 lines of the "questions" table with probability 0.2 of question that have type_id = 1 and probability 0.8 of question that have type_id =2.
Below my request, how to add the probability?
$query = "select * from questions ORDER BY RAND() LIMIT 10";
I want to display 10 questions which 20% of the questions have type_id = 2 and 80% have type_id = 1.
can someone help me please?

As I noted in the comments, you won't be able to use anything as obvious as ORDER BY RAND() if you want to include probabilities or anything like that. ORDER BY RAND() simply doesn't support that kind of thing. ORDER BY RAND() is also very slow, and not really suitable for use on a database of any significant size anyway.
There are a whole bunch of approaches you can use to do a random sort order with weighting or probabilities; I'm not going to try to discuss them all; I'll just give you a relatively simple one, but please be aware that the best technique for you will depend on your specific use case.
A simple approach would be something like this:
Create a new integer field on your table called weight or something similar.
Add a DB index for this field to enable you to query it quickly.
Set the first record to a value equal to its weighting as a whole number. ie a probability of 0.2 could be a weight of 20.
Set each subsequent record to the max value of this field plus the weight for that record. So if the second record is also 0.2, it would get a value of 40; if the one after that is only 0.1, it would be 50; and so on.
Do likewise for any new records that get added.
Now you can select a random record, with different weights for each record, as follows:
SELECT * FROM questions
WHERE weight >= FLOOR(RAND() * (SELECT MAX(weight) FROM questions))
ORDER BY weight
LIMIT 1
(note, I'm writing is answer in a hurry and without resource to test it; I haven't run this query so I may have got the syntax wrong, but the basic technique is sound)
This will pick a random number between zero and the largest weight value, and then find the question record that has the closest weight value to that random number.
Also, because the weight field is indexed, this query will be quick and efficient.
Downsides of this technique: It assumes that the weights for any given record won't change. If the weight of a record does need to change, then you would have to update the weight value for every record after it in the index.
[EDIT]
Let's imagine a table like this:
id Name
1 Question One
2 Question Two
3 Question Three
4 Question Four
5 Question Five
In this example, we want Questions 1 and 2 to have a probability of 0.2, question 3 to have a probability of 0.1 and questions 4 and 5 to have a probability of 0.3. Those probabilities can be expressed as integers by multiplying them by 100. (multiply by 10 also works, but 100 means we can have probabilities like 0.15 as well)
We add the weight column and the index for it, and set the weight values as follows:
id Name Weight
1 Question One 20
2 Question Two 40 (ie previous value + 20)
3 Question Three 50 (ie previous value + 10)
4 Question Four 80 (ie previous value + 30)
5 Question Five 110 (ie previous value + 30)
Now we can run our query.
The random part of the query FLOOR(RAND() * (SELECT MAX(weight) FROM questions)) will select a value between zero and 110. Let's imagine it gives 68.
Now the rest of our query says to pick the first record where the weight is greater than 68. In this case, that means that the record we get is record #4.
This gives us our probability because the random number could be anything, but is more likely to select a given record if the gap between its weight and the one before it is larger. You'll get record #4 three times as often as record #3.

Algorithm to detect numbers that do not make sense

I am using an application that collects price data and makes sensible buying and selling prices each time data is retrieved. Now it can happen that the numbers are way to high or way too small because of how to system works. I can't do anything about this.
Now my question is, if I have an array of number like:
$prices = ['300','312','293','298','1025','12'];
What would be a good algorithm to get rid of the 12 and 1025? Note that a higher number appears far more often than a really low number so simply taking a average doesn't work.
I thought about taking a average of the whole array, looping through the array and checking for a differential percentage for each item and check if it under the threshold but I thought that this wouldn't be as accurate as I would like.

Have you thought about absolute numbers?
If I understood you correct there are multiple price lists so the average valid price could differ, it could be 1000 and some could be around 300 like in your example, my algorithm suggestion will work with both. You did not inform if the price would always be as close as in the examples or it could be higher if the price was higher.
I will split my answer in four parts, the first part will be for both situations (price difference is low at low values and high at high values). And the second part will be useful if the price difference will increase as the average valid price increases. The third part will be the whole algorithm for how you want to wrap it all together. The last part will be what to do at the first run.
Part 1: Finding a value for validation processing
you say that you have a list of these numbers and that it retrieves new data all the time. The way I would suggest you do, is that if you subtract two numbers with each other and the absolute value.
Example:
300-312=|12|
With the number 12 we can conclude that both these prices are in the valid price range. Now let's take 3 other examples, one where both values are invalid and one where only one is invalid.
Example:
1025-12=|1013|
We can see that 1013 is no way an average price in this list, since both are invalid we have to test them both against a valid price. The algorithm will then remove them both.
Example:
300-12=|288|
We can see that 288 isn't a valid price either, the algorithm will remove 12.
Part 2: validating a price with varying price differences
If you have lists where the average price could have a difference of 400, -50 and +50 in difference will give you bugs in your algorithm, therefore you need a way to determine this in a scalable way, that will make sure higher numbers can have higher differences in prices.
If the absolute value is Higher than 20%(or another number) of the average number of the two numbers, they would need further validation.
Example:
(300+312)/2=306 is the average number.
306*0.2=61,2
If you have a stored value of the highest and lowest valid number you could use 20% of their average to determine the threshold.
(293+312)/2=302,5
302,5*0,2=60,5
Part 3: wrapping it all up and making an algorithm
So the first thing you should do is to determine the amount of data in each list, the number of lists, and how often you recieve data, the bigger the amount of data and the more often you recieve data, it would be reasonable to index your data. The way I would suggest is that for each list you save the highest and lowest valid number. If this is not the case you can skip this part and look at part 4 as you can basically run the algorithm against the whole list each time you recieve new data.
First add 4 values to a list, min price, max price, average price and threshold. The average price is (max price+min price)/2. After this you can use a % of the average price to determine a threshold for your prices, I will suggest 20% since it will result in a number close to the number you use which is 50, find the threshold by multiplying the average number with 0,2.
Depending on your data you can always chose to find a threshold based on 20% of the average of min value, max value and a new number ((min+max+new)/2*0,2), you can change this calculation if the difference should ever change.
When you recieve new numbers your algorithm should check the absolute number against the threshold.
Depending on the frequency of new numbers I would suggest this at a low frequency.
ProcessNumber(var value)
{
if(absoluteValue(MinValue-value)<=MaxValue*0,2) //depending on how many numbers you want to be valid you can change the threshold, by doing this you allow the maximum value to change if the new number is valid but higher than max value
{
addNumber(value);
}
else
{
deleteNumber(value);
}
}
If the process of retrieving new numbers happens very often you can add two numbers at once, if odd numbers occur 1/3 times I'd suggest the above method instead.
ProcessNumbers(var value1, var value2)
{
if(absoluteValue(value1-value2)<=threshold) //if you want the thresholdnumber to be valid too, use less than or equal to
{
addnumber(value1);
addnumber(value2);
return true
}//If you have a method to add them
else
if(checkNumber(value1)) // returns true if valid)
{ //we now know value 1 is valid
deleteNumber(value2); //because the check was false and we know value1 is valid value2 must be the invalid one
addNumber(value1);
}
else if(checkNumber(value2))
{ //we now know value 2 is valid
deleteNumber(value1);
addNumber(value2);
}
else
{ //we now know both values are invalid
deleteNumber(value1);
deleteNumber(value2);
}
}
Part 4: first run
You will need an algorithm for the first run, if there currently are no invalid numbers and you didn't skip you can ignore this part.
For the first run you should group the numbers to sorted lists by what threshold they are in.
You take two numbers at a time and see if the absolute value is below the threshold.
absolute = value1-value2;
threshold = value1+value2)/2*0.2;
if(absolute<threshold)
AddToThreshold(threshold,value1,value2);
else
AddToLater(value1,value2);
the AddTolater is a list that contains values you have to doublecheck since you don't know if value1, value2 or both values sent them into this list.
The addtothreshold makes sure that if there's a threshold group with a value higher than the threshold submitted the values will be submitted to this group.
Now you should have a few groups with thresholds, what you do now is take the lowest of the lowest group and take the lowest of the highest group and check if their absolute value is below their threshold, you can then use this threshold to figure out if other absolute values are below this particular threshold and sort them from each other, let's take your list and use the lowest threshold with the highest absolute number from two valid numbers.
Threshold:
(293+298)/2=295,5*0.2=59,1 (this is the threshold)
Highest possible absolute number from 2 valid numbers:
293-312=|19|
This became a really long post and I hope it can give you at least some inspiration, although it might not be necessary with that much processing if you do not have that many lists all of this might be an overkill unless you are planning something scalable.
best of luck!

What you are describing is called outlier detection. There are statistical tests for this purpose. Beware anyway that nothing can guarantee 100% reliability.
http://en.wikipedia.org/wiki/Outlier#Identifying_outliers

Scratchcard PHP script

I'm working on a scratchcard script and I was wondering if someone could help me out, if you don't understand odds this may melt your brain a little!
So, the odds can vary: 1/$x; Let's say for now: $x = 36;
So here's what I am trying to understand...
I want to generate 9 random numbers between 1 and 5.
I want the odds of 3 numbers matching equivalent to 1/36.
It must be impossible to generate over 3 duplicate numbers at a time.
I can imagine an array loop of some kind would probably be the correct way of passage?

Sometimes - and this is one of those times - cheating is the best way to do what you want to do.
a) Set up an array of your 9 numbers, and a 2nd frequency array (5 elements) that counts which number occurs how often.
b) Generate a random number 1-5. Set the 1st and 2nd card to this number, and mark this number with 2 in your freqency array.
c) If random(36) < 1 (1/36 probability), set your 3rd card to the same number and mark this number with 3 in your frequency array.
d) Generate the rest of the cards - find a random number, repeat while frequency of the found number >=2, set the next card to number, increase frequency of the found number.
e) When finished, shuffle the cards (generate 2 random numbers between 1 and 9, swap the 2 cards, repeat 20-30 times).
Part d) is what i call cheating - you've put your 1/36 probabilty in step c), and in d) you just make sure you don't generate another match. e) is used to hide that from the user.

Lottery number analysis

I'm trying to perform some basic analysis on Lotto results :)
I have a database that looks something like:
id|no|day|dd|mmm|yyyy|n1|n2|n3|n4|n5|n6|bb|jackpot|wins|machine|set
--------------------------------------------------------------------
1 |22|mon|22|aug|1999|01|05|11|29|38|39|04|2003202| 1 | Topaz | 3
2 |23|tue|24|aug|1999|01|06|16|21|25|39|03|2003202| 2 | Pearl | 1
That's just an example. So, n1 to n6 are standard balls in the lottery and bb stands for the bonus ball.
I want to write a PHP/SQL code that will display just one random sequence of numbers that have yet to come out. However, If the numbers 01, 04, 05, 11, 29, 38 and 39 have come out, I don't want the code to print out them numbers but just in a different order, as in theory them set of numbers are already winning numbers.
I just can't get my head around the logic of this. I'd appreciate any help.
Thanks in advance

Assuming that the balls are stored in ascending order in your database like the examples you've given, you could just generate a random sequence of 6 numbers, sort them and then generate 1 random bonus number. Once you've done that it would just be a matter of doing a simple SQL query into your database and seeing if it comes back with a result:
$nums=...//generate your 6 numbers plus bonus number here
sort($nums);
$mysqli=new mysqli('...','...','...','...');
$stmt=$mysqli->prepare("SELECT * FROM table
WHERE n1=? AND n2=? AND n3=? AND n4=? AND n5=? AND n6=? AND bb=?");
$stmt->bind_param('iiiiiii', $nums[0], $nums[1], $nums[2], $nums[3], $nums[4], $nums[5], $nums[6]);
$stmt->execute();
$stmt->store_result();
if($stmt->num_rows==0)
//your numbers have not been drawn before - return them
else
//otherwise loop round and try again
As long as both list of numbers (but not the bonus ball) are sorted you won't have any problems with a different ordering of an already drawn set of numbers.
This will become less efficient as your database of previous draws gets fuller, but I don't think you'll have to worry about that for a few decades. :-)

What about sorting each already drawn result (each row) in some order, ascending maybe, then sort the set of already drawn results (all rows)? Then you will have a easy to look up in list in which you can see what is left to be drawn.
Say for example you want a never drawn set before? You would just have to loop through the list until you spot a "hole", which would be a never before drawn set. If you would like to optimise further you could store at what index you last found a "hole" as well. Then you would never need to loop through the same part of the list twice, and you could even abandon "completed" parts of the list to save disk space, or if you would like the new number you come up with to seam random you could start at a random offset in the list.
To do this effectively you should make an extra column to store the pre-sorted set. For example if you have (5, 3, 6, 4, 1, 2) that column could contain 010203040506. Add in enough zeros so that the numbers occur on a fixed offset basis.

Search over multiple tables, display even number of results or increase number of one table's result if others have less hits

I have multiple tables/content types searched for a keyword and a fixed number of "result slots" for the autocomplete in the UI.
Let's assume there are 4 tables (persons,pages,articles,places) and 12 result slots. When a search returns 3 or more hits in each table, 3 results are displayed for each table.
I need an algorithm (preferably PHP) that increases the number of slots for a table when there are less than three results in the others. It should "fill up" the slots with results from the other tables as long as there are slots (and of course results) left
e.g.
person: 6
pages: 3
articles:2
places: 1
thanks!

Interesting question.
Lets say you have 4 categories A,B,C,D in the order of priority.
Fetch the number of rows of A,B,C,D
The function min(3,X) returns the smaller of 3 and X. Now do your initial allocation of slots by
Alloc_A=min(3,A)
Alloc_B=min(3,B)
Alloc_C=min(3,C)
Alloc_D=min(3,D)
The remaining slots are then:
Rem_A=A-Alloc_A
and so on.
The number of free slots are then:
free_slots=12-Alloc_A-Alloc_B-Alloc_C-Alloc_D
As for filling in the remaining slots, you can do it in proportion to the number of remainaing articles. We can allocate in proportion by
Alloc_A+=round(Rem_A/(Rem_A+Rem_B+Rem_C+Rem_D))
Alloc_B+=round(Rem_B/(Rem_A+Rem_B+Rem_C+Rem_D))
and so on. For example if there are 4 free slots and there are 9 in B and 3 in D,This will allocate 3/4 slots to B and 1 to D. But this can get unfair if, say b is 10 times as large as D. You can cap the others as, say 3x the smallest one.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.