Get only random 20% of consecutive inputs

Get only random 20% of consecutive inputs - php

I have a system that constantly gathers items from a rss feed.
I want to take only a certain percentage, say 20%, of those items, randomly.
My approach is that for each item I "throw a dice" using rand(0,100) and accept the item only if the result of this statement is < 20.
Is it a good approach?

If you are sure your random is truly random, then yes, that is a perfectly fine approach.
Note that it is probably easier to do a random 0, 5 and only accept it when it is 1 (same effect, 20% is 1/5th of a 100). You'll have a narrower distribution that way. Although, this requires a round to integer which is an additional operation.

Your approach is correct. However, the standard way of selecting values at random is just to simulate from a uniform(0,1) and accept/reject as appropriate. Your pseudo-code is then:
if(unif(0,1) < 0.2)
##Do something
After you select n items from a total of N entries, you have been sampling from the Binomial distribution with parameters N and p=0.2. For example, if N=10000, then you would have selected (on average) N*p=10000*0.2=2000 items. However, the variance will be: N*p*(1-p) = 1600. So selecting anywhere between
(2000 - 2*sqrt(1600), 2000 + 2*sqrt(1600)) = (1920, 2080)
would be reasonable.

Related

Subset Sum floats Elimations

I will be happy to get some help. I have the following problem:
I'm given a list of numbers and a target number.
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20)
I need to find an algorithm that will find all numbers that combined will sum target number ex: 20.
First find all int equal 20
And next for example the best combinations here are:
11.96 + 8.04
1 + 10 + 9
11.13 + 7.8 + 1.07
9 + 11
Remaining value 15.04.
I need an algorithm that uses 1 value only once and it could use from 1 to n values to sum target number.
I tried some recursion in PHP but runs out of memory really fast (50k values) so a solution in Python will help (time/memory wise).
I'd be glad for some guidance here.
One possible solution is this: Finding all possible combinations of numbers to reach a given sum
The only difference is that I need to put a flag on elements already used so it won't be used twice and I can reduce the number of possible combinations
Thanks for anyone willing to help.

there are many ways to think about this problem.
If you do recursion make sure to identify your end cases first, then proceed with the rest of the program.
This is the first thing that comes to mind.
<?php
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20);
function subset_sum($a,$s,$c = array())
{
if($s<0)
return;
if($s!=0&&count($a)==0)
return;
if($s!=0)
{
foreach($a as $xd=>$xdd)
{
unset($a[$xd]);
subset_sum($a,$s-$xdd,array_merge($c,array($xdd)));
}
}
else
print_r($c);
}
?>

This is possible solution, but it's not pretty:
import itertools
import operator
from functools import reduce
def subset_num(array, num):
subsets = reduce(operator.add, [list(itertools.combinations(array, r)) for r in range(1, 1 + len(array))])
return [subset for subset in subsets if sum(subset) == num]
print(subset_num([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20))
Output:
[(20,), (11.96, 8.04), (9, 11), (11, 9), (1, 10, 9), (1, 10, 9), (7.8, 11.13, 1.07)]

DISCLAIMER: this is not a full solution, it is a way to just help you build the possible subsets. It does not help you to pick which ones go together (without using the same item more than once and getting the lowest remainder).
Using dynamic programming you can build all the subsets that add up to the given sum, then you will need to go through them and find which combination of subsets is best for you.
To build this archive you can (I'm assuming we're dealing with non-negative numbers only) put the items in a column, go from top to bottom and for each element compute all the subsets that add up to the sum or a lower number than it and that include only items from the column that are in the place you are looking at or higher. When you build a subset you put in its node both the sum of the subset (which may be the given sum or smaller) and the items that are included in the subset. So in order to compute the subsets for an item [i] you need only look at the subsets you've created for item [i-1]. For each of them there are 3 options:
1) the subset's sum is the given sum ---> Keep the subset as it is and move to the next one.
2) the subset's sum is smaller than the given sum but larger than it if item [i] is added to it ---> Keep the subset as it is and move on to the next one.
3) the subset's sum is smaller than the given sum and it will still be smaller or equal to it if item [i] is added to it ---> Keep one copy of the subset as it is and create another one with item [i] added to it (both as a member and added to the sum of the subset).
When you're done with the last item (item [n]), look at the subsets you've created - each one has its sum in its node and you can see which ones are equal to the given sum (and which ones are smaller - you don't need those anymore).
As I wrote at the beginning - now you need to figure out how to take the best combination of subsets that do not have a shared member between any of them.
Basically you're left with a problem that resembles the classic knapsack problem but with another limitation (not every stone can be taken with every other stone). Maybe the limitation actually helps, I'm not sure.
A bit more about the advantage of dynamic programming in this case
The basic idea of dynamic programming instead of recursion is to trade redundancy of operations with occupation of memory space. By that I mean to say that recursion with a complex problem (normally a backtrack knapsack-like problem, as we have here) normally ends up calculating the same thing a fair amount of times because the different branches of calculation have no concept of each other's operations and results. Dynamic programming saves the results and uses them along the way to build "bigger" results, relying on the previous/"smaller" ones. Because the use of the stack is much more straightforward than in recursion, you don't get the memory problem you get with recursion regarding the maintenance of the function's state, but you do need to handle a great deal of memory that you store (sometimes you can optimise that).
So for example in our problem, trying to combine a subset that would add up to the required sum, the branch that starts with item A and the branch that starts with item B do not know of each other's operations. let's assume item C and item D together add up to the sum, but either of them added alone to A or B would not exceed the sum, and that A don't go with B in the solution (we can have sum=10, A=B=4, C=D=5 and there is no subset that sums up to 2 (so A and B can't be in the same group)). The branch trying to figure out A's group would (after trying and rejecting having B in its group) add C (A+C=9) and then add D, in which point would reject this group and trackback (A+C+D=14 > sum=10). The same would happen to B of course (A=B) because the branch figuring out B's group has no information regarding what just happened to the branch dealing with A. So in fact we've calculated C+D twice, and haven't even used it yet (and we're about to calculate it yet a third time to realise they belong in a group of their own).
NOTE:
Looking around while writing this answer I came across a technique I was not familiar with and might be a better solution for you: memoization. Taken from wikipedia:
memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

So I have a possbile solution:
#compute difference between 2 list but keep duplicates
def list_difference(a, b):
count = Counter(a) # count items in a
count.subtract(b) # subtract items that are in b
diff = []
for x in a:
if count[x] > 0:
count[x] -= 1
diff.append(x)
return diff
#return combination of numbers that match target
def subset_sum(numbers, target, partial=[]):
s = sum(partial)
# check if the partial sum is equals to target
if s == target:
print "--------------------------------------------sum_is(%s)=%s" % (partial, target)
return partial
else:
if s >= target:
return # if we reach the number why bother to continue
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
rest = subset_sum(remaining, target, partial + [n])
if type(rest) is list:
#repeat until rest is > target and rest is not the same as previous
def repeatUntil(subset, target):
currSubset = []
while sum(subset) > target and currSubset != subset:
diff = subset_sum(subset, target)
currSubset = subset
subset = list_difference(subset, diff)
return subset
Output:
--------------------------------------------sum_is([11.96, 8.04])=20
--------------------------------------------sum_is([1, 10, 9])=20
--------------------------------------------sum_is([7.8, 11.13, 1.07])=20
--------------------------------------------sum_is([20])=20
--------------------------------------------sum_is([9, 11])=20
[15.04]
Unfortunately this solution does work for a small list. For a big list still trying to break the list in small chunks and calculate but the answer is not quite correct. You can see it o a new thread here:
Finding unique combinations of numbers to reach a given sum

Algorithm to detect numbers that do not make sense

I am using an application that collects price data and makes sensible buying and selling prices each time data is retrieved. Now it can happen that the numbers are way to high or way too small because of how to system works. I can't do anything about this.
Now my question is, if I have an array of number like:
$prices = ['300','312','293','298','1025','12'];
What would be a good algorithm to get rid of the 12 and 1025? Note that a higher number appears far more often than a really low number so simply taking a average doesn't work.
I thought about taking a average of the whole array, looping through the array and checking for a differential percentage for each item and check if it under the threshold but I thought that this wouldn't be as accurate as I would like.

Have you thought about absolute numbers?
If I understood you correct there are multiple price lists so the average valid price could differ, it could be 1000 and some could be around 300 like in your example, my algorithm suggestion will work with both. You did not inform if the price would always be as close as in the examples or it could be higher if the price was higher.
I will split my answer in four parts, the first part will be for both situations (price difference is low at low values and high at high values). And the second part will be useful if the price difference will increase as the average valid price increases. The third part will be the whole algorithm for how you want to wrap it all together. The last part will be what to do at the first run.
Part 1: Finding a value for validation processing
you say that you have a list of these numbers and that it retrieves new data all the time. The way I would suggest you do, is that if you subtract two numbers with each other and the absolute value.
Example:
300-312=|12|
With the number 12 we can conclude that both these prices are in the valid price range. Now let's take 3 other examples, one where both values are invalid and one where only one is invalid.
Example:
1025-12=|1013|
We can see that 1013 is no way an average price in this list, since both are invalid we have to test them both against a valid price. The algorithm will then remove them both.
Example:
300-12=|288|
We can see that 288 isn't a valid price either, the algorithm will remove 12.
Part 2: validating a price with varying price differences
If you have lists where the average price could have a difference of 400, -50 and +50 in difference will give you bugs in your algorithm, therefore you need a way to determine this in a scalable way, that will make sure higher numbers can have higher differences in prices.
If the absolute value is Higher than 20%(or another number) of the average number of the two numbers, they would need further validation.
Example:
(300+312)/2=306 is the average number.
306*0.2=61,2
If you have a stored value of the highest and lowest valid number you could use 20% of their average to determine the threshold.
(293+312)/2=302,5
302,5*0,2=60,5
Part 3: wrapping it all up and making an algorithm
So the first thing you should do is to determine the amount of data in each list, the number of lists, and how often you recieve data, the bigger the amount of data and the more often you recieve data, it would be reasonable to index your data. The way I would suggest is that for each list you save the highest and lowest valid number. If this is not the case you can skip this part and look at part 4 as you can basically run the algorithm against the whole list each time you recieve new data.
First add 4 values to a list, min price, max price, average price and threshold. The average price is (max price+min price)/2. After this you can use a % of the average price to determine a threshold for your prices, I will suggest 20% since it will result in a number close to the number you use which is 50, find the threshold by multiplying the average number with 0,2.
Depending on your data you can always chose to find a threshold based on 20% of the average of min value, max value and a new number ((min+max+new)/2*0,2), you can change this calculation if the difference should ever change.
When you recieve new numbers your algorithm should check the absolute number against the threshold.
Depending on the frequency of new numbers I would suggest this at a low frequency.
ProcessNumber(var value)
{
if(absoluteValue(MinValue-value)<=MaxValue*0,2) //depending on how many numbers you want to be valid you can change the threshold, by doing this you allow the maximum value to change if the new number is valid but higher than max value
{
addNumber(value);
}
else
{
deleteNumber(value);
}
}
If the process of retrieving new numbers happens very often you can add two numbers at once, if odd numbers occur 1/3 times I'd suggest the above method instead.
ProcessNumbers(var value1, var value2)
{
if(absoluteValue(value1-value2)<=threshold) //if you want the thresholdnumber to be valid too, use less than or equal to
{
addnumber(value1);
addnumber(value2);
return true
}//If you have a method to add them
else
if(checkNumber(value1)) // returns true if valid)
{ //we now know value 1 is valid
deleteNumber(value2); //because the check was false and we know value1 is valid value2 must be the invalid one
addNumber(value1);
}
else if(checkNumber(value2))
{ //we now know value 2 is valid
deleteNumber(value1);
addNumber(value2);
}
else
{ //we now know both values are invalid
deleteNumber(value1);
deleteNumber(value2);
}
}
Part 4: first run
You will need an algorithm for the first run, if there currently are no invalid numbers and you didn't skip you can ignore this part.
For the first run you should group the numbers to sorted lists by what threshold they are in.
You take two numbers at a time and see if the absolute value is below the threshold.
absolute = value1-value2;
threshold = value1+value2)/2*0.2;
if(absolute<threshold)
AddToThreshold(threshold,value1,value2);
else
AddToLater(value1,value2);
the AddTolater is a list that contains values you have to doublecheck since you don't know if value1, value2 or both values sent them into this list.
The addtothreshold makes sure that if there's a threshold group with a value higher than the threshold submitted the values will be submitted to this group.
Now you should have a few groups with thresholds, what you do now is take the lowest of the lowest group and take the lowest of the highest group and check if their absolute value is below their threshold, you can then use this threshold to figure out if other absolute values are below this particular threshold and sort them from each other, let's take your list and use the lowest threshold with the highest absolute number from two valid numbers.
Threshold:
(293+298)/2=295,5*0.2=59,1 (this is the threshold)
Highest possible absolute number from 2 valid numbers:
293-312=|19|
This became a really long post and I hope it can give you at least some inspiration, although it might not be necessary with that much processing if you do not have that many lists all of this might be an overkill unless you are planning something scalable.
best of luck!

What you are describing is called outlier detection. There are statistical tests for this purpose. Beware anyway that nothing can guarantee 100% reliability.
http://en.wikipedia.org/wiki/Outlier#Identifying_outliers

Create fixed length non-repeating permutation within certain ranges in PHP

I've got a table with 1000 recipes in it, each recipe has calories, protein, carbs and fat values associated with it.
I need to figure out an algorithm in PHP that will allow me to specify value ranges for calories, protein, carbs and fat as well as dictating the number of recipes in each permutation. Something like:
getPermutations($recipes, $lowCal, $highCal, $lowProt, $highProt, $lowCarb, $highCarb, $lowFat, $highFat, $countRecipes)
The end goal is allowing a user to input their calorie/protein/carb/fat goals for the day (as a range, 1500-1600 calories for example), as well as how many meals they would like to eat (count of recipes in each set) and returning all the different meal combinations that fit their goals.
I've tried this previously by populating a table with every possible combination (see: Best way to create Combination of records (Order does not matter, no repetition allowed) in mySQL tables ) and querying it with the range limits, however that proved not to be efficient as I end up with billions of records to scan through and it takes an indefinite amount of time.
I've found some permutation algorithms that are close to what I need, but don't have the value range restraint for calories/protein/carbs/fat that I'm looking for (see: Create fixed length non-repeating permutation of larger set) I'm at a loss at this point when it comes to this type of logic/math, so any help is MUCH appreciated.

Based on some comment clarification, I can suggest one way to go about it. Specifically, this is my "try the simplest thing that could possibly work" approach to a problem that is potentially quite tricky.
First, the tricky part is that the sum of all meals has to be in a certain range, but SQL does not have a built-in feature that I'm aware of that does specifically what you want in one pass; that's ok, though, as we can just implement this functionality in PHP instead.
So lets say you request 5 meals that will total 2000 calories - we leave the other variables aside for simplicity, but they will work the same way. We then calculate that the 'average' meal is 2000/5=400 calories, but obviously any one meal could be over or under that amount. I'm no dietician, but I assume you'll want no meal that takes up more than 1.25x-2x the average meal size, so we can restrict out initial query to this amount.
$maxCalPerMeal = ($highCal / $countRecipes) * 1.5;
$mealPlanCaloriesRemaining = $highCal; # more on this one in a minute
We then request 1 random meal which is less than $maxCalPerMeal, and 'save' it as our first meal. We then subtract its actual calorie count from $mealPlanCaloriesRemaining. We now recalculate:
$maxCalPerMeal = ($highCal / $countRecipesRemaining) * 1.5); # 1.5 being a maximum deviation from average multiple
Now the next query will ask for both a random meal that is less than $maxCalPerMeal AND $mealPlanCaloriesRemaining, AND NOT one of the meals you already have saved in this particular meal plan option (thus ensuring unique meals - no mac'n'cheese for breakfast, lunch, and dinner!). And we update the variables as in the last query, until you reach the end. For the last meal requested it we don't care about the average and it's associated multiple, as thanks to a compound query you'll get what you want anyway and don't need to complicate your control loops.
Assuming the worst case with the 5 meal 2000 calorie max diet:
Meal 1: 600 calories
Meal 2: 437
Meal 3: 381
Meal 4: 301
Meal 5: 281
Or something like that, and in most cases you'll get something a bit nicer and more random. But in the worst-case it still works! Now this actually just plain works for the usual case. Adding more maximums like for fat and protein, etc, is easy, so lets deal with the lows next.
All we need to do to support "minimum calories per day" is add another set of averages, as such:
$minCalPerMeal = ($lowCal / $countRecipes) * .5 # this time our multiplier is less than one, as we allow for meals to be bigger than average we must allow them to be smaller as well
And you restrict the query to being greater than this calculated minimum, recalculating with each loop, and happiness naturally ensues.
Finally we must deal with the degenerate case - what if using this method you end up needing a meal that is to small or too big to fill the last slot? Well, you can handle this a number of ways. Here's what I'd recommended.
The easiest is just returning less than the desired amount of meals, but this might be unacceptable. You could also have special low calorie meals that, due to the minimum average dietary content, would only be likely to be returned if someone really had to squeeze in a light meal to make the plan work. I rather like this solution.
The second easiest is throw out the meal plan you have so far and regenerate from scratch; it might work this time, or it just might not, so you'll need a control loop to make sure you don't get into an infinite work-intensive loop.
The least easy, requires a control loop max iteration again, but here you use a specific strategy to try to get a more acceptable meal plan. In this you take the optional meal with the highest value that is exceeding your dietary limits and throw it out, then try pulling a smaller meal - perhaps one that is no greater than the new calculated average. It might make the plan as a whole work, or you might go over value on another plan, forcing you back into a loop that could be unresolvable - or it might just take a few dozen iterations to get one that works.
Though this sounds like a lot when writing it out, even a very slow computer should be able to churn out hundreds of thousands of suggested meal plans every few seconds without pausing. Your database will be under very little strain even if you have millions of recipes to choose from, and the meal plans you return will be as random as it gets. It would also be easy to make certain multiple suggested meal plans are not duplicates with a simple comparison and another call or two for an extra meal plan to be generated - without fear of noticeable delay!
By breaking things down to small steps with minimal mathematical overhead a daunting task becomes manageable - and you don't even need a degree in mathematics to figure it out :)
(As an aside, I think you have a very nice website built there, so no worries!)

random function: higher values appear less often than lower

I have a tricky question that I've looked into a couple of times without figuring it out.
Some backstory: I am making a textbased RPG-game where players fight against animals/monsters etc. It works like any other game where you hit a number of hitpoints on each other every round.
The problem: I am using the random-function in php to generate the final value of the hit, depending on levels, armor and such. But I'd like the higher values (like the max hit) to appear less often than the lower values.
This is an example-graph:
How can I reproduce something like this using PHP and the rand-function? When typing rand(1,100) every number has an equal chance of being picked.
My idea is this: Make a 2nd degree (or quadratic function) and use the random number (x) to do the calculation.
Would this work like I want?
The question is a bit tricky, please let me know if you'd like more information and details.

Please, look at this beatiful article:
http://www.redblobgames.com/articles/probability/damage-rolls.html
There are interactive diagrams considering dice rolling and percentage of results.
This should be very usefull for you.
Pay attention to this kind of rolling random number:
roll1 = rollDice(2, 12);
roll2 = rollDice(2, 12);
damage = min(roll1, roll2);
This should give you what you look for.

OK, here's my idea :
Let's say you've got an array of elements (a,b,c,d) and you won't to randomly pick one of them. Doing a rand(1,4) to get the random element index, would mean that all elements have an equal chance to appear. (25%)
Now, let's say we take this array : (a,b,c,d,d).
Here we still have 4 elements, but not every one of them has equal chances to appear.
a,b,c : 20%
d : 40%
Or, let's take this array :
(1,2,3,...,97,97,97,98,98,98,99,99,99,100,100,100,100)
Hint : This way you won't only bias the random number generation algorithm, but you'll actually set the desired probability of apparition of each one (or of a range of numbers).
So, that's how I would go about that :
If you want numbers from 1 to 100 (with higher numbers appearing more frequently, get a random number from 1 to 1000 and associate it with a wider range. E.g.
rand = 800-1000 => rand/10 (80->100)
rand = 600-800 => rand/9 (66->88)
...
Or something like that. (You could use any math operation you imagine, modulo or whatever... and play with your algorithm). I hope you get my idea.
Good luck! :-)

generate a random number between 1 and x where a lower number is more likely than a higher one

This is more of a maths/general programming question, but I am programming with PHP is that makes a difference.
I think the easiest way to explain is with an example.
If the range is between 1 and 10.
I want to generate a number that is between 1 an 10 but is more likely lower than high.
The only way I can think is generate an array with 10 elements equal to 1, 9 elements equal to 2, 8 elements equal to 3.....1 element equal to 10. Then generate a random number based on the number of elements.
The trouble is I am potentially dealing with 1 - 100000 and that array would be ridiculously big.
So how best to do it?

Generate a random number between 0 and a random number!

Generate a number between 1 and foo(n), where foo runs an algorithm over n (e.g. a logarithmic function). Then reverse foo() on the result.

Generate number n which is 0 <= n < 1, multiply it by itself, than multiply by x, run floor on it and add 1. Sorry I used php toooo long ago to write code in it

You could do
$rand = floor(100000 * (rand(0, 1)*rand(0, 1)));
Or something along these lines

There are basically two (or more?) ways to map uniform density to any distribution function: Inverse transformation sampling and Rejection sampling. I think in your case you should use the former.

Quick and simple:
rand(1, rand(1, n))

What you need to do is generate a random number over a greater interval (preferably floating point), and map that into [1,10] in a nonuniform way. Exactly what way depends on how much more likely you want a 1 to be than a 9 or 10.
For C language solutions, see these libraries. You may find use for this in PHP.

Generally speaking, it looks like you want to draw a random number from a Poisson distribution rather than the [uniform distribution](http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)). On the wiki page cited above there is a section which specifically states how you can use the continuous distribution to generate a pseudo-Poisson distribution... check it out. Note that you may want to test different values of λ to ensure the distribution works as you want it to.

It depends on what distribution you want to have exactly, i.e., what number should appear with what probability.
For instance, for even n you could do the following: generate one integer random number x between 1 and n/2 and generate a second number between 1 and n+1. If y > x you generate x otherwise you generate n-x+1. This should give you the distribution in your example.

I think this should give the requested distribution:
Generate a random number in the range 1 .. x. Generate another one in the range 1 .. x+1.
Return the minimum of the two.

Let's think about how your array idea changes the probabilities. Normally every element from 1 to n has a probability of 1/n and is thus equally likely.
Since you have n entries for 1, n-1 entries for 2...1 entry for n, then the total number of entries you have is an arithmetic series. The sum of an arithmetic series counting from 1 to n is n(1+n)/2. So now we know every element's probability should use that as the denominator.
Element 1 has n entries, so it's probability is n/n(1+n)/2. Element 2 is n-1/n(1+n)/2 ... n is 1/n(1+n)/2. That gives a general formula of the numerator as n+1 -i, where i is the number you are checking. That means we now have a function for the probability of any element as n-i+1/n(1+n)/2. all probabilities are between 0 and 1 and sum to 1 by definition, and that is key to the next step.
How can we use this function to skew the number of times an element appears? It's easier with continuous distributions (ie doubles instead of ints) but we can do it. First let's make an array of our probabilities, call it c, and make a running sum of them (cumsum) and store it back in c. If that doesn't make sense, its just a loop like
for(j=0; j < n-1; j++)
if(j) c[j]+=c[j-1]
Now that we have this cumulative distribution, generate a number i from 0 to 1 (a double, not an int. We can check if i is between 0 and c[0], return 1. if i is between c[1] and c[2] return 2...all the way up to n. e.g.
for(j=0; j < n=1;j++)
if(i %lt;= c[j]) return i+1
This will distribute the integers according to the probabilities you have calculated.

<?php
//get random number between 1 and 10,000
$random = mt_rand(1, 10000);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.