Related
I'm writing a PHP script to use to explore the Collatz conjecture.
Long before I get to orders of magnitude such as 10^18, PHP switches to scientific notation and drops precision. I currently limit it to 10^12 because these values don't suffer from loss of precision. How should I go about handling larger integers without triggering this rounding effect?
The BC Math functions would work for this: https://www.php.net/manual/en/ref.bc.php
For example:
<?php
$number = "931386509544713451";
echo $number;
$steps = 0;
while ($number > 1) {
if (bcmod($number, 2) == 0) {
$number = bcdiv($number, 2);
} else {
$number = bcadd(bcmul($number, 3), 1);
}
echo ', ' . $number;
$steps++;
}
echo ' [Steps=' . $steps . ']';
?>
You can run that at http://sandbox.onlinephpfunctions.com/code/e55e3c94d0920ff036f1c7feb8ce839f75e9df43
That will output the entire sequence for that starting number and give the correct amount of steps, which is 2283.
(Disclaimer: I haven't written PHP for years, so this is almost certainly not a good example of PHP code. It's just for demonstrating the BC Maths functions.)
I'm doing a calculation in PHP using bcmath, and need to raise e by a fractional exponent. Unfortunately, bcpow() only accepts integer exponents. The exponent is typically higher precision than a float will allow, so normal arithmetic functions won't cut it.
For example:
$e = exp(1);
$pow = "0.000000000000000000108420217248550443400745280086994171142578125";
$result = bcpow($e, $pow);
Result is "1" with the error, "bc math warning: non-zero scale in exponent".
Is there another function I can use instead of bcpow()?
Your best bet is probably to use the Taylor series expansion. As you noted, PHP's bcpow is limited to raising to integer exponentiation.
So what you can do is roll your own bc factorial function and use the wiki page to implement a Taylor series expansion of the exponential function.
function bcfac($num) {
if ($num==0) return 1;
$result = '1';
for ( ; $num > 0; $num--)
$result = bcmul($result,$num);
return $result;
}
$mysum = '0';
for ($i=0; $i<300; $i++) {
$mysum = bcadd($mysum, bcdiv(bcpow($pow,$i), bcfac($i)) );
}
print $mysum;
Obviously, the $i<300 is an approximation for infinity... You can change it to suit your performance needs.
With $i=20, I got
1.00000000000000000010842021724855044340662275184110560868263421994092888869270293594926619547803962155136242752708629105688492780863293090291376157887898519458498571566021915144483905034693109606778068801680332504212458366799913406541920812216634834265692913062346724688397654924947370526356787052264726969653983148004800229537555582281617497990286595977830803702329470381960270717424849203303593850108090101578510305396615293917807977774686848422213799049363135722460179809890014584148659937665374616
This is comforting since that small of an exponent should yield something really close to 1.0.
Old question, but people might still be interested nonetheless.
So Kevin got the right idea with the Taylor-polynomial, but when you derive your algorithm from it directly, you can get into trouble, mainly your code gets slow for long input-strings when using large cut-off values for $i.
Here is why:
At every step, by which I mean with each new $i, the code calls bcfac($i). Everytime bcfac is called it performs $i-1 calculations. And $i goes all the way up to 299... that's almost 45000 operations! Not your quick'n'easy floating point operations, but slow BC-string-operations - if you set bcscale(100) your bcmul has to handle up to 10000 pairs of chars!
Also bcpow slows down with increasing $i, too. Not as much as bcfac, because it propably uses something akin to the square-and-multiply method, but it still adds something.
Overall the time required grows quadraticly with the number of polynomial terms computed.
So... what to do?
Here's a tip:
Whenever you handle polynomials, especially Taylor-polynomials, use the Horner method.
It converts this: exp(x) = x^0/0! + x^1/1! + x^2/2! + x^3/3! + ...
...into that: exp(x) = ((( ... )*x/3+1 )*x/2+1 )*x/1+1
And suddenly you don't need any powers or factorials at all!
function bc_exp($number) {
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $number), 1);
return $result;
}
This needs only 3 bc-operations for each step, no matter what $i is.
With a starting value of $i=299 (to calculate exp with the same precision as kevin's code does) we now only need 897 bc-operations, compared to more than 45000.
Even using 30 as cut-off instead of 300, we now only need 87 bc-operations while the other code still needs 822 for the factorials alone.
Horner's Method saving the day again!
Some other thoughts:
1) Kevin's code would propably crash with input="0", depending on how bcmath handles errors, because the code trys bcpow(0,0) at the first step ($i=0).
2) Larger exponents require longer polynomials and therefore more iterations, e.g. bc_exp(300) will give a wrong answer, even with $i=299, whyle something like bc_exp(3) will work fine and dandy.
Each term adds x^n/n! to the result, so this term has to get small before the polynomial can start to converge. Now compare two consecutive terms:
( x^(n+1)/(n+1)! ) / ( x^n/n! ) = x/n
Each summand is larger than the one before by a factor of x/n (which we used via the Horner method), so in order for x^(n+1)/(n+1)! to get small x/n has to get small as well, which is only the case when n>x.
Inconclusio: As long as the number of iterations is smaller than the input value, the result will diverge. Only when you add steps until your number of iterations gets larger than the input, the algorithm starts to slowly converge.
In order to reach results that can satisfie someone who is willing to use bcmath, your $i needs to be significantly larger then your $number. And that's a huge proplem when you try to calculate stuff like e^346674567801
A solution is to divide the input into its integer part and its fraction part.
Than use bcpow on the integer part and bc_exp on the fraction part, which now converges from the get-go since the fraction part is smaller than 1. In the end multiply the results.
e^x = e^(intpart+fracpart) = e^intpart * e^fracpart = bcpow(e,intpart) * bc_exp(fracpart)
You could even implement it directly into the code above:
function bc_exp2($number) {
$parts = explode (".", $number);
$fracpart = "0.".$parts[1];
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $fracpart), 1);
$result = bcmul(bcpow(exp(1), $parts[0]), $result);
return $result;
}
Note that exp(1) gives you a floating-point number which propably won't satisfy your needs as a bcmath user. You might want to use a value for e that is more accurate, in accordance with your bcscale setting.
3) Talking about numbers of iterations: 300 will be overkill in most situations while in some others it might not even be enough. An algorithm that takes your bcscale and $number and calculates the number of required iterations would be nice. Alraedy got some ideas involving log(n!), but nothing concrete yet.
4) To use this method with an arbitrary base you can use a^x = e^(x*ln(a)).
You might want to divide x into its intpart and fracpart before using bc_exp (instead of doing that within bc_exp2) to avoid unneccessary function calls.
function bc_pow2($base,$exponent) {
$parts = explode (".", $exponent);
if ($parts[1] == 0){
$result = bcpow($base,$parts[0]);
else $result = bcmul(bc_exp(bcmul(bc_ln($base), "0.".$parts[1]), bcpow($base,$parts[0]);
return result;
}
Now we only need to program bc_ln. We can use the same strategy as above:
Take the Taylor-polynomial of the natural logarithm function. (since ln(0) isn't defined, take 1 as developement point instead)
Use Horner's method to drasticly improve performance.
Turn the result into a loop of bc-operations.
Also make use of ln(x) = -ln(1/x) when handling x > 1, to guarantee convergence.
usefull functions(don't forget to set bcscale() before using them)
function bc_fact($f){return $f==1?1:bcmul($f,bc_fact(bcsub($f, '1')));}
function bc_exp($x,$L=50){$r=bcadd('1.0',$x);for($i=0;$i<$L;$i++){$r=bcadd($r,bcdiv(bcpow($x,$i+2),bc_fact($i+2)));}return $r;}#e^x
function bc_ln($x,$L=50){$r=0;for($i=0;$i<$L;$i++){$p=1+$i*2;$r = bcadd(bcmul(bcdiv("1.0",$p),bcpow(bcdiv(bcsub($x,"1.0"),bcadd($x,"1.0")),$p)),$r);}return bcmul("2.0", $r);}#2*Sum((1/(2i+1))*(((x-1)/x+1)^(2i+1)))
function bc_pow($x,$p){return bc_exp(bcmul((bc_ln(($x))), $p));}
I've looked at php-big numbers, BC Math, and GMP for dealing with very big numbers in php. But none seem to have a function equivilent to php's log(). For example I want to do this:
$result = log($bigNumber, 2);
Would anyone know of an alternate way to get the log base 2 of a arbitray precision point number in php? Maybe Ive missed a function, or library, or formula.
edit: php-bignumbers seems to have a log base 10 function only log10()
In general if you want to implement your high precision log own calculation, I'd suggest 1st use the basic features of logarithm:
log_a(x) = log_b(x) / log_b(a) |=> thus you can recalulate logarith to any base
log(x*y) = log(x) + log(y)
log(a**n) = n*log(a)
where log_a(x) - meaning logarithm to the base a of x; log means natural logarithm
So log(1000000000000000000000.123) = 21*log(1.000000000000000000000123)
and for high precision of log(1+x)
use algorithm referenced at
http://en.wikipedia.org/wiki/Natural_logarithm#High_precision
One solution combining the suggestions so far would be to use this formula:
log2($num) = log10($num) / log10(2)
in conjunction with php-big numbers since it has a pre-made log10 function.
eg, after installing the php-big numbers library, use:
$log2 = log10($bigNum) / log10(2);
Personally I've decided to use different math/logic so as to not need the log function, and just using bcmath for the big numbers.
One of the great things about base 2 is that counting and shifting become part of the tool set.
So one way to get a 'log2' of a number is to convert it to a binary string and count the bits.
You can accomplish this equivalently by dividing by 2 in a loop. But it seems to me that counting would be more efficient.
gmp_scan0 and gmp_scan1 can be used if you are counting from the right. But you'd have to somehow convert the mixed bits to all ones and zeroes.
But using gmp_strval(num, 2), you can produce a string and do a strpos on it.
if the whole value is being converted, you can do a (strlen - 1) on it.
Obviously this only works when you want an integer log.
I've had a very similar problem just recently.. and so I just scaled the number considerably in order to use the inbuild log to find the fractional part.. (I prefere the log10 for some reason.. don't ask... people are strange, me too)
I hope this is selfexplanatory enough..
it returns a float value (since that's what I needed)
function gmp_log($num, $base=10, $full=true)
{
if($base == 10)
$string = gmp_strval($num);
else
$string = gmp_strval($num,$base);
$intpart = strlen($string)-1;
if(!$full)
return $intpart;
if($base ==10)
{
$string = substr_replace($string, ".", 1, 0);
$number = floatval($string);
$lg = $intpart + log10($number);
return $lg;
}
else
{
$string = gmp_strval($num);
$intpart = strlen($string)-1;
$string = substr_replace($string, ".", 1, 0);
$number = floatval($string);
$lg = $intpart + log10($number);
$lb = $lg / log10($base);
return $lb;
}
}
it's quick, it's dirty... but it works well enough to get the log of some RSA sized integers ;)
usage is straight forward as well
$N = gmp_init("11002930366353704069");
echo gmp_log($N,10)."\n";
echo gmp_log($N,10, false)."\n";
echo gmp_log($N,2)."\n";
echo gmp_log($N,16)."\n";
returns
19.041508364472
19
63.254521604973
15.813630401243
I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.
I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.
However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?
Could you say that the order of similarity kind of goes:
Closest to 1 from below down to 0 - most similar as it moves from 0 to one.
Closest to 1 from above - less and less similar the further away it gets.
Thank you!
My code, as requested is:
$norm1 = 0;
foreach ($dict1 as $value) {
$valuesq = $value * $value;
$norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);
To give you an idea of the kinds of values I'm getting:
0.9076645291077
2.0680991116095
1.4015600717928
1.0377360186767
1.8563586243689
1.0349674872379
1.2083865384822
2.3000034036913
0.84280491429133
Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:
<?php
function calc_norm($arr) {
$norm = 0;
foreach ($arr as $value) {
$valuesq = $value * $value;
$norm = $norm + $valuesq;
}
return(sqrt($norm));
}
$dict1 = array(5,0,97);
$dict2 = array(300,2,124);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));
print_r($cospheta);
?>
I don't know if I'm missing something but I think you are not applying the sum and the square root to the values in the dict2 (the query I assume).
If you do not normalised per query you can get results greater than one. However, this is done some times as it is ranking equivalent (proportional) to the correct result and it is quicker to compute.
I hope this helps.
Due to the vagaries of floating point arithmetic, you could have calculations which, when represented in the binary form that computers use, are not exact. Probably you can just round down. Likewise for numbers slightly less than zero.
I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.