I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.
Related
I am attempting to find the cartesian product and append specific criteria.
I have four pools of 25 people each. Each person has a score and a price. Each person in each pool looks as such.
[0] => array(
"name" => "jacob",
"price" => 15,
"score" => 100
),
[1] => array(
"name" => "daniel",
"price" => 22,
"score" => 200
)
I want to find the best combination of people, with one person being picked from each pool. However, there is a ceiling price where no grouping can exceed a certain price.
I have been messing with cartesians and permutation functions and cannot seem to figure out how to do this. The only way I know how to code it is to have nested foreach loops, but that is incredibly taxing.
This code below, as you can see, is incredibly inefficient. Especially if the pools increase!
foreach($poolA as $vA) {
foreach($poolb as $vB) {
foreach($poolC as $vC) {
foreach($poolD as $vD) {
// calculate total price and check if valid
// calculate total score and check if greatest
// if so, add to $greatest array
}
}
}
}
I also thought I could find a way to calculate the total price/score ratio and use that to my advantage, but I don't know what I'm missing.
As pointed out by Barmar, sorting the people in each pool allows you to halt the loops early when the total price exceeds the limit and hence reduces the number of cases you need to check. However, the asymptotic complexity for applying this improvement is still O(n4) (where n is the number of people in a pool).
I will outline an alternative approach with better asymptotic complexity as follow:
Construct a pool X that contains all pairs of people with one from pool A and the other from pool B.
Construct a pool Y that contains all pairs of people with one from pool C and the other from pool D.
Sort the pairs in pool X by total price. Then for any pairs with the same price, retain the one with the highest score and discard the remaining pairs.
Sort the pairs in pool Y by total price. Then for any pairs with the same price, retain the one with the highest score and discard the remaining pairs.
Do a loop with two pointers to check over all possible combinations that satisfy the price constraint, where the head pointer starts at the first item in pool X, and the tail pointer starts at the last item in pool Y. Sample code is given below to illustrate how this loop works:
==========================================================================
$head = 0;
$tail = sizeof($poolY) - 1;
while ($head < sizeof($poolX) && $tail >= 0) {
$total_price = $poolX[$head].price + $poolY[$tail].price;
// Your logic goes here...
if ($total_price > $price_limit) {
$tail--;
} else if ($total_price < $price_limit) {
$head++;
} else {
$head++;
$tail--;
}
}
for ($i = $head; $i < sizeof($poolX); $i++) {
// Your logic goes here...
}
for ($i = $tail; $i >= 0; $i--) {
// Your logic goes here...
}
==========================================================================
The complexity of steps 1 and 2 are O(n2), and the complexity of steps 3 and 4 can be done in O(n2 log(n)) using balanced binary tree. And step 5 is essentially a linear scan over n2 items, so the complexity is also O(n2). Therefore the overall complexity of this approach is O(n2 log(n)).
A couple of things to note about your approach here. Speaking strictly from a mathematics perspective, you're calculating way more permutations than is actually necessary to arrive at a definitive answer.
In combinatorics, there are two important questions to ask in order to arrive at the exact number of permutations necessary to yield all possible combinations.
Does order matter? (for your case, it does not)
Is repetition allowed? (for your case, it is not necessary to repeat)
Since the answer to both of these question is no, you need only a fraction of the iterations you're currently doing with your nested loop. Currently you are doing, pow(25, 4) permutations, which is 390625. You only actually need n! / r! (n-r)! or gmp_fact(25) / (gmp_fact(4) * gmp_fact(25 - 4)) which is only 12650 total permutations needed.
Here's a simple example of a function that produces combinations without repetition (and where order does not matter), using a generator in PHP (taken from this SO answer).
function comb($m, $a) {
if (!$m) {
yield [];
return;
}
if (!$a) {
return;
}
$h = $a[0];
$t = array_slice($a, 1);
foreach(comb($m - 1, $t) as $c)
yield array_merge([$h], $c);
foreach(comb($m, $t) as $c)
yield $c;
}
$a = range(1,25); // 25 people in each pool
$n = 4; // 4 pools
foreach(comb($n, $a) as $i => $c) {
echo $i, ": ", array_sum($c), "\n";
}
It would be pretty easy to modify the generator function to check whether the sum of prices meets/exceeds the desired threshhold and only return valid results from there (i.e. abandoning early where needed).
The reason repetition and order are not important here for your use case, is because it doesn't matter whether you add $price1 + $price2 or $price2 + $price1, the result will undoubtedly be the same in both permutations. So you only need to add up each unique set once to ascertain all possible sums.
Similar to chiwangs solutions, you may eliminate up front every group member, where another group member in that group exists, with same or higher score for a lower price.
Maybe you can eliminate many members in each group with this approach.
You may then either use this technique, to build two pairs and repeat the filtering (eliminate pairs, where anothr pair exists, with higher score for the same or lower costs) and then combine the pairs the same way, or add a member step by step (one pair, a triple, a quartett).
If there exists some member, who exceed the allowed sum price on their own, they can be eliminated up front.
If you order the 4 groups by score descending, and you find a solution abcd, where the sum price is legal, you found the optimal solution for a given set of abc.
The reponses here helped me figure out the best way for me to do this.
I haven't optimized the function yet, but essentially I looped through each results two at a time to find the combined salaries / scores for each combination in the two pools.
I stored the combined salary -> score combination in a new array, and if the salary already existed, I'd compare scores and remove the lower one.
$results = array();
foreach($poolA as $A) {
foreach($poolB as $B) {
$total_salary = $A['Salary'] + $B['Salary'];
$total_score = $A['Score'] + $B['Score'];
$pids = array($A['pid'], $B['pid']);
if(isset($results[$total_salary]) {
if($total_score > $results[$total_salary]['Score']) {
$results[$total_salary]['Score'] => $total_score;
$results[$total_salary]['pid'] => $pids;
} else {
$results[$total_salary]['Score'] = $total_score;
$results[$total_salary]['pid'] = $pids;
}
}
}
After this loop, I have another one that is identical, except my foreach loops are between $results and $poolC.
foreach($results as $R) {
foreach($poolC as $C) {
and finally, I do it one last time for $poolD.
I am working on optimizing the code by putting all four foreach loops into one.
Thank you everyone for your help, I was able to loop through 9 lists with 25+ people in each and find the best result in an incredibly quick processing time!
I'm doing a calculation in PHP using bcmath, and need to raise e by a fractional exponent. Unfortunately, bcpow() only accepts integer exponents. The exponent is typically higher precision than a float will allow, so normal arithmetic functions won't cut it.
For example:
$e = exp(1);
$pow = "0.000000000000000000108420217248550443400745280086994171142578125";
$result = bcpow($e, $pow);
Result is "1" with the error, "bc math warning: non-zero scale in exponent".
Is there another function I can use instead of bcpow()?
Your best bet is probably to use the Taylor series expansion. As you noted, PHP's bcpow is limited to raising to integer exponentiation.
So what you can do is roll your own bc factorial function and use the wiki page to implement a Taylor series expansion of the exponential function.
function bcfac($num) {
if ($num==0) return 1;
$result = '1';
for ( ; $num > 0; $num--)
$result = bcmul($result,$num);
return $result;
}
$mysum = '0';
for ($i=0; $i<300; $i++) {
$mysum = bcadd($mysum, bcdiv(bcpow($pow,$i), bcfac($i)) );
}
print $mysum;
Obviously, the $i<300 is an approximation for infinity... You can change it to suit your performance needs.
With $i=20, I got
1.00000000000000000010842021724855044340662275184110560868263421994092888869270293594926619547803962155136242752708629105688492780863293090291376157887898519458498571566021915144483905034693109606778068801680332504212458366799913406541920812216634834265692913062346724688397654924947370526356787052264726969653983148004800229537555582281617497990286595977830803702329470381960270717424849203303593850108090101578510305396615293917807977774686848422213799049363135722460179809890014584148659937665374616
This is comforting since that small of an exponent should yield something really close to 1.0.
Old question, but people might still be interested nonetheless.
So Kevin got the right idea with the Taylor-polynomial, but when you derive your algorithm from it directly, you can get into trouble, mainly your code gets slow for long input-strings when using large cut-off values for $i.
Here is why:
At every step, by which I mean with each new $i, the code calls bcfac($i). Everytime bcfac is called it performs $i-1 calculations. And $i goes all the way up to 299... that's almost 45000 operations! Not your quick'n'easy floating point operations, but slow BC-string-operations - if you set bcscale(100) your bcmul has to handle up to 10000 pairs of chars!
Also bcpow slows down with increasing $i, too. Not as much as bcfac, because it propably uses something akin to the square-and-multiply method, but it still adds something.
Overall the time required grows quadraticly with the number of polynomial terms computed.
So... what to do?
Here's a tip:
Whenever you handle polynomials, especially Taylor-polynomials, use the Horner method.
It converts this: exp(x) = x^0/0! + x^1/1! + x^2/2! + x^3/3! + ...
...into that: exp(x) = ((( ... )*x/3+1 )*x/2+1 )*x/1+1
And suddenly you don't need any powers or factorials at all!
function bc_exp($number) {
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $number), 1);
return $result;
}
This needs only 3 bc-operations for each step, no matter what $i is.
With a starting value of $i=299 (to calculate exp with the same precision as kevin's code does) we now only need 897 bc-operations, compared to more than 45000.
Even using 30 as cut-off instead of 300, we now only need 87 bc-operations while the other code still needs 822 for the factorials alone.
Horner's Method saving the day again!
Some other thoughts:
1) Kevin's code would propably crash with input="0", depending on how bcmath handles errors, because the code trys bcpow(0,0) at the first step ($i=0).
2) Larger exponents require longer polynomials and therefore more iterations, e.g. bc_exp(300) will give a wrong answer, even with $i=299, whyle something like bc_exp(3) will work fine and dandy.
Each term adds x^n/n! to the result, so this term has to get small before the polynomial can start to converge. Now compare two consecutive terms:
( x^(n+1)/(n+1)! ) / ( x^n/n! ) = x/n
Each summand is larger than the one before by a factor of x/n (which we used via the Horner method), so in order for x^(n+1)/(n+1)! to get small x/n has to get small as well, which is only the case when n>x.
Inconclusio: As long as the number of iterations is smaller than the input value, the result will diverge. Only when you add steps until your number of iterations gets larger than the input, the algorithm starts to slowly converge.
In order to reach results that can satisfie someone who is willing to use bcmath, your $i needs to be significantly larger then your $number. And that's a huge proplem when you try to calculate stuff like e^346674567801
A solution is to divide the input into its integer part and its fraction part.
Than use bcpow on the integer part and bc_exp on the fraction part, which now converges from the get-go since the fraction part is smaller than 1. In the end multiply the results.
e^x = e^(intpart+fracpart) = e^intpart * e^fracpart = bcpow(e,intpart) * bc_exp(fracpart)
You could even implement it directly into the code above:
function bc_exp2($number) {
$parts = explode (".", $number);
$fracpart = "0.".$parts[1];
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $fracpart), 1);
$result = bcmul(bcpow(exp(1), $parts[0]), $result);
return $result;
}
Note that exp(1) gives you a floating-point number which propably won't satisfy your needs as a bcmath user. You might want to use a value for e that is more accurate, in accordance with your bcscale setting.
3) Talking about numbers of iterations: 300 will be overkill in most situations while in some others it might not even be enough. An algorithm that takes your bcscale and $number and calculates the number of required iterations would be nice. Alraedy got some ideas involving log(n!), but nothing concrete yet.
4) To use this method with an arbitrary base you can use a^x = e^(x*ln(a)).
You might want to divide x into its intpart and fracpart before using bc_exp (instead of doing that within bc_exp2) to avoid unneccessary function calls.
function bc_pow2($base,$exponent) {
$parts = explode (".", $exponent);
if ($parts[1] == 0){
$result = bcpow($base,$parts[0]);
else $result = bcmul(bc_exp(bcmul(bc_ln($base), "0.".$parts[1]), bcpow($base,$parts[0]);
return result;
}
Now we only need to program bc_ln. We can use the same strategy as above:
Take the Taylor-polynomial of the natural logarithm function. (since ln(0) isn't defined, take 1 as developement point instead)
Use Horner's method to drasticly improve performance.
Turn the result into a loop of bc-operations.
Also make use of ln(x) = -ln(1/x) when handling x > 1, to guarantee convergence.
usefull functions(don't forget to set bcscale() before using them)
function bc_fact($f){return $f==1?1:bcmul($f,bc_fact(bcsub($f, '1')));}
function bc_exp($x,$L=50){$r=bcadd('1.0',$x);for($i=0;$i<$L;$i++){$r=bcadd($r,bcdiv(bcpow($x,$i+2),bc_fact($i+2)));}return $r;}#e^x
function bc_ln($x,$L=50){$r=0;for($i=0;$i<$L;$i++){$p=1+$i*2;$r = bcadd(bcmul(bcdiv("1.0",$p),bcpow(bcdiv(bcsub($x,"1.0"),bcadd($x,"1.0")),$p)),$r);}return bcmul("2.0", $r);}#2*Sum((1/(2i+1))*(((x-1)/x+1)^(2i+1)))
function bc_pow($x,$p){return bc_exp(bcmul((bc_ln(($x))), $p));}
I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.
I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.
However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?
Could you say that the order of similarity kind of goes:
Closest to 1 from below down to 0 - most similar as it moves from 0 to one.
Closest to 1 from above - less and less similar the further away it gets.
Thank you!
My code, as requested is:
$norm1 = 0;
foreach ($dict1 as $value) {
$valuesq = $value * $value;
$norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);
To give you an idea of the kinds of values I'm getting:
0.9076645291077
2.0680991116095
1.4015600717928
1.0377360186767
1.8563586243689
1.0349674872379
1.2083865384822
2.3000034036913
0.84280491429133
Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:
<?php
function calc_norm($arr) {
$norm = 0;
foreach ($arr as $value) {
$valuesq = $value * $value;
$norm = $norm + $valuesq;
}
return(sqrt($norm));
}
$dict1 = array(5,0,97);
$dict2 = array(300,2,124);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));
print_r($cospheta);
?>
I don't know if I'm missing something but I think you are not applying the sum and the square root to the values in the dict2 (the query I assume).
If you do not normalised per query you can get results greater than one. However, this is done some times as it is ranking equivalent (proportional) to the correct result and it is quicker to compute.
I hope this helps.
Due to the vagaries of floating point arithmetic, you could have calculations which, when represented in the binary form that computers use, are not exact. Probably you can just round down. Likewise for numbers slightly less than zero.
What would be a good way to generate 7 unique random numbers between 1 and 10.
I can't have any duplicates.
I could write a chunk of PHP to do this (using rand() and pushing used numbers onto an array) but there must be a quick way to do it.
any advice would be great.
Create an array from 1 to 10 (range).
Put it in random order
(shuffle).
Select 7 items from the array (array_slice)
Populate an array with ten elements (the numbers one through ten), shuffle the array, and remove the first (or last) three elements.
Simple one-liner:
print_r(array_rand(array_fill(1, 10, true), 7));
Check out the comments in the php manual, there are several solutions for this.
An easy one is this one:
$min = 1;
$max = 10;
$total = 7;
$rand = array();
while (count($rand) < $total ) {
$r = mt_rand($min,$max);
if (!in_array($r,$rand)) $rand[] = $r;
}
Whole numbers? Well, if you want 7 out of 10 then you more efficiently DON'T want 3 out of 10.
Feel free to use any of the other responses but instead of creating 7 numbers start with 10 and eliminate 3. That will tend to speed things up by more than double.
The "shuffle" method has a MAJOR FALW. When the numbers are big, shuffle 3 billion indexs will instantly CAUSE 500 error. Here comes a best solution for really big numbers.
function getRandomNumbers($min, $max, $total) {
$temp_arr = array();
while(sizeof($temp_arr) < $total) $temp_arr[rand($min, $max)] = true;
return $temp_arr;
}
Say I want to get 10 unique random numbers from 1 billion to 4 billion.
$random_numbers = getRandomNumbers(1000000000,4000000000,10);
PS: Execution time: 0.027 microseconds
rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.
No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.
I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small
Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.
The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>
What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.
Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose
This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);