Cosine similarity result above one

Cosine similarity result above one - php

I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.
I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.
However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?
Could you say that the order of similarity kind of goes:
Closest to 1 from below down to 0 - most similar as it moves from 0 to one.
Closest to 1 from above - less and less similar the further away it gets.
Thank you!
My code, as requested is:
$norm1 = 0;
foreach ($dict1 as $value) {
$valuesq = $value * $value;
$norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);
To give you an idea of the kinds of values I'm getting:
0.9076645291077
2.0680991116095
1.4015600717928
1.0377360186767
1.8563586243689
1.0349674872379
1.2083865384822
2.3000034036913
0.84280491429133

Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:
<?php
function calc_norm($arr) {
$norm = 0;
foreach ($arr as $value) {
$valuesq = $value * $value;
$norm = $norm + $valuesq;
}
return(sqrt($norm));
}
$dict1 = array(5,0,97);
$dict2 = array(300,2,124);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));
print_r($cospheta);
?>

I don't know if I'm missing something but I think you are not applying the sum and the square root to the values in the dict2 (the query I assume).
If you do not normalised per query you can get results greater than one. However, this is done some times as it is ranking equivalent (proportional) to the correct result and it is quicker to compute.
I hope this helps.

Due to the vagaries of floating point arithmetic, you could have calculations which, when represented in the binary form that computers use, are not exact. Probably you can just round down. Likewise for numbers slightly less than zero.

Related

PHP: How to raise number to (tiny) fractional exponent?

I'm doing a calculation in PHP using bcmath, and need to raise e by a fractional exponent. Unfortunately, bcpow() only accepts integer exponents. The exponent is typically higher precision than a float will allow, so normal arithmetic functions won't cut it.
For example:
$e = exp(1);
$pow = "0.000000000000000000108420217248550443400745280086994171142578125";
$result = bcpow($e, $pow);
Result is "1" with the error, "bc math warning: non-zero scale in exponent".
Is there another function I can use instead of bcpow()?

Your best bet is probably to use the Taylor series expansion. As you noted, PHP's bcpow is limited to raising to integer exponentiation.
So what you can do is roll your own bc factorial function and use the wiki page to implement a Taylor series expansion of the exponential function.
function bcfac($num) {
if ($num==0) return 1;
$result = '1';
for ( ; $num > 0; $num--)
$result = bcmul($result,$num);
return $result;
}
$mysum = '0';
for ($i=0; $i<300; $i++) {
$mysum = bcadd($mysum, bcdiv(bcpow($pow,$i), bcfac($i)) );
}
print $mysum;
Obviously, the $i<300 is an approximation for infinity... You can change it to suit your performance needs.
With $i=20, I got
1.00000000000000000010842021724855044340662275184110560868263421994092888869270293594926619547803962155136242752708629105688492780863293090291376157887898519458498571566021915144483905034693109606778068801680332504212458366799913406541920812216634834265692913062346724688397654924947370526356787052264726969653983148004800229537555582281617497990286595977830803702329470381960270717424849203303593850108090101578510305396615293917807977774686848422213799049363135722460179809890014584148659937665374616
This is comforting since that small of an exponent should yield something really close to 1.0.

Old question, but people might still be interested nonetheless.
So Kevin got the right idea with the Taylor-polynomial, but when you derive your algorithm from it directly, you can get into trouble, mainly your code gets slow for long input-strings when using large cut-off values for $i.
Here is why:
At every step, by which I mean with each new $i, the code calls bcfac($i). Everytime bcfac is called it performs $i-1 calculations. And $i goes all the way up to 299... that's almost 45000 operations! Not your quick'n'easy floating point operations, but slow BC-string-operations - if you set bcscale(100) your bcmul has to handle up to 10000 pairs of chars!
Also bcpow slows down with increasing $i, too. Not as much as bcfac, because it propably uses something akin to the square-and-multiply method, but it still adds something.
Overall the time required grows quadraticly with the number of polynomial terms computed.
So... what to do?
Here's a tip:
Whenever you handle polynomials, especially Taylor-polynomials, use the Horner method.
It converts this: exp(x) = x^0/0! + x^1/1! + x^2/2! + x^3/3! + ...
...into that: exp(x) = ((( ... )*x/3+1 )*x/2+1 )*x/1+1
And suddenly you don't need any powers or factorials at all!
function bc_exp($number) {
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $number), 1);
return $result;
}
This needs only 3 bc-operations for each step, no matter what $i is.
With a starting value of $i=299 (to calculate exp with the same precision as kevin's code does) we now only need 897 bc-operations, compared to more than 45000.
Even using 30 as cut-off instead of 300, we now only need 87 bc-operations while the other code still needs 822 for the factorials alone.
Horner's Method saving the day again!
Some other thoughts:
1) Kevin's code would propably crash with input="0", depending on how bcmath handles errors, because the code trys bcpow(0,0) at the first step ($i=0).
2) Larger exponents require longer polynomials and therefore more iterations, e.g. bc_exp(300) will give a wrong answer, even with $i=299, whyle something like bc_exp(3) will work fine and dandy.
Each term adds x^n/n! to the result, so this term has to get small before the polynomial can start to converge. Now compare two consecutive terms:
( x^(n+1)/(n+1)! ) / ( x^n/n! ) = x/n
Each summand is larger than the one before by a factor of x/n (which we used via the Horner method), so in order for x^(n+1)/(n+1)! to get small x/n has to get small as well, which is only the case when n>x.
Inconclusio: As long as the number of iterations is smaller than the input value, the result will diverge. Only when you add steps until your number of iterations gets larger than the input, the algorithm starts to slowly converge.
In order to reach results that can satisfie someone who is willing to use bcmath, your $i needs to be significantly larger then your $number. And that's a huge proplem when you try to calculate stuff like e^346674567801
A solution is to divide the input into its integer part and its fraction part.
Than use bcpow on the integer part and bc_exp on the fraction part, which now converges from the get-go since the fraction part is smaller than 1. In the end multiply the results.
e^x = e^(intpart+fracpart) = e^intpart * e^fracpart = bcpow(e,intpart) * bc_exp(fracpart)
You could even implement it directly into the code above:
function bc_exp2($number) {
$parts = explode (".", $number);
$fracpart = "0.".$parts[1];
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $fracpart), 1);
$result = bcmul(bcpow(exp(1), $parts[0]), $result);
return $result;
}
Note that exp(1) gives you a floating-point number which propably won't satisfy your needs as a bcmath user. You might want to use a value for e that is more accurate, in accordance with your bcscale setting.
3) Talking about numbers of iterations: 300 will be overkill in most situations while in some others it might not even be enough. An algorithm that takes your bcscale and $number and calculates the number of required iterations would be nice. Alraedy got some ideas involving log(n!), but nothing concrete yet.
4) To use this method with an arbitrary base you can use a^x = e^(x*ln(a)).
You might want to divide x into its intpart and fracpart before using bc_exp (instead of doing that within bc_exp2) to avoid unneccessary function calls.
function bc_pow2($base,$exponent) {
$parts = explode (".", $exponent);
if ($parts[1] == 0){
$result = bcpow($base,$parts[0]);
else $result = bcmul(bc_exp(bcmul(bc_ln($base), "0.".$parts[1]), bcpow($base,$parts[0]);
return result;
}
Now we only need to program bc_ln. We can use the same strategy as above:
Take the Taylor-polynomial of the natural logarithm function. (since ln(0) isn't defined, take 1 as developement point instead)
Use Horner's method to drasticly improve performance.
Turn the result into a loop of bc-operations.
Also make use of ln(x) = -ln(1/x) when handling x > 1, to guarantee convergence.

usefull functions(don't forget to set bcscale() before using them)
function bc_fact($f){return $f==1?1:bcmul($f,bc_fact(bcsub($f, '1')));}
function bc_exp($x,$L=50){$r=bcadd('1.0',$x);for($i=0;$i<$L;$i++){$r=bcadd($r,bcdiv(bcpow($x,$i+2),bc_fact($i+2)));}return $r;}#e^x
function bc_ln($x,$L=50){$r=0;for($i=0;$i<$L;$i++){$p=1+$i*2;$r = bcadd(bcmul(bcdiv("1.0",$p),bcpow(bcdiv(bcsub($x,"1.0"),bcadd($x,"1.0")),$p)),$r);}return bcmul("2.0", $r);}#2*Sum((1/(2i+1))*(((x-1)/x+1)^(2i+1)))
function bc_pow($x,$p){return bc_exp(bcmul((bc_ln(($x))), $p));}

How to get log() of a very big number (PHP)?

I've looked at php-big numbers, BC Math, and GMP for dealing with very big numbers in php. But none seem to have a function equivilent to php's log(). For example I want to do this:
$result = log($bigNumber, 2);
Would anyone know of an alternate way to get the log base 2 of a arbitray precision point number in php? Maybe Ive missed a function, or library, or formula.
edit: php-bignumbers seems to have a log base 10 function only log10()

In general if you want to implement your high precision log own calculation, I'd suggest 1st use the basic features of logarithm:
log_a(x) = log_b(x) / log_b(a) |=> thus you can recalulate logarith to any base
log(x*y) = log(x) + log(y)
log(a**n) = n*log(a)
where log_a(x) - meaning logarithm to the base a of x; log means natural logarithm
So log(1000000000000000000000.123) = 21*log(1.000000000000000000000123)
and for high precision of log(1+x)
use algorithm referenced at
http://en.wikipedia.org/wiki/Natural_logarithm#High_precision

One solution combining the suggestions so far would be to use this formula:
log2($num) = log10($num) / log10(2)
in conjunction with php-big numbers since it has a pre-made log10 function.
eg, after installing the php-big numbers library, use:
$log2 = log10($bigNum) / log10(2);
Personally I've decided to use different math/logic so as to not need the log function, and just using bcmath for the big numbers.

One of the great things about base 2 is that counting and shifting become part of the tool set.
So one way to get a 'log2' of a number is to convert it to a binary string and count the bits.
You can accomplish this equivalently by dividing by 2 in a loop. But it seems to me that counting would be more efficient.
gmp_scan0 and gmp_scan1 can be used if you are counting from the right. But you'd have to somehow convert the mixed bits to all ones and zeroes.
But using gmp_strval(num, 2), you can produce a string and do a strpos on it.
if the whole value is being converted, you can do a (strlen - 1) on it.
Obviously this only works when you want an integer log.

I've had a very similar problem just recently.. and so I just scaled the number considerably in order to use the inbuild log to find the fractional part.. (I prefere the log10 for some reason.. don't ask... people are strange, me too)
I hope this is selfexplanatory enough..
it returns a float value (since that's what I needed)
function gmp_log($num, $base=10, $full=true)
{
if($base == 10)
$string = gmp_strval($num);
else
$string = gmp_strval($num,$base);
$intpart = strlen($string)-1;
if(!$full)
return $intpart;
if($base ==10)
{
$string = substr_replace($string, ".", 1, 0);
$number = floatval($string);
$lg = $intpart + log10($number);
return $lg;
}
else
{
$string = gmp_strval($num);
$intpart = strlen($string)-1;
$string = substr_replace($string, ".", 1, 0);
$number = floatval($string);
$lg = $intpart + log10($number);
$lb = $lg / log10($base);
return $lb;
}
}
it's quick, it's dirty... but it works well enough to get the log of some RSA sized integers ;)
usage is straight forward as well
$N = gmp_init("11002930366353704069");
echo gmp_log($N,10)."\n";
echo gmp_log($N,10, false)."\n";
echo gmp_log($N,2)."\n";
echo gmp_log($N,16)."\n";
returns
19.041508364472
19
63.254521604973
15.813630401243

PHP how to show all decimal places?

this might be a stupid question but I have searched again and again without finding any results.
So, what I want is to show all the decimal places of a number without knowing how many decimal places it will have. Take a look at this small code:
$arrayTest = array(0.123456789, 0.0123456789);
foreach($arrayTest as $output){
$newNumber = $output/1000;
echo $newNumber;
echo "<br>";
}
It gives this output:
0.000123456789
1.23456789E-5
Now, I tried using 'number_format', but I don't think that is a good solution. It determines an exact amount of decimal places, and I do not know the amount of decimal places for every number. Take a look at the below code:
$arrayTest = array(0.123456789, 0.0123456789);
foreach($arrayTest as $output){
$newNumber = $output/1000;
echo number_format($newNumber,13);
echo "<br>";
}
It gives this output:
0.0001234567890
0.0000123456789
Now, as you can see there is an excess 0 in the first number, because number_format forces it to have 13 decimal places.
I would really love some guidance on how to get around this problem. Is there a setting in PHP.ini which determines the amount of decimals?
Thank you very much in advance!
(and feel free to ask if you have any further questions)

It is "impossible" to answer this question properly - because a binary float representation of a decimal number is approximate: "What every computer scientist should know about floating point"
The closest you can come is write yourself a routine that looks at a decimal representation of a number, and compares it to the "exact" value; once the difference becomes "small enough for your purpose", you stop adding more digits.
This routine could then return the "correct number of digits" as a string.
Example:
<?php
$a = 1.234567890;
$b = 0.123456789;
echo returnString($a)."\n";
echo returnString($b)."\n";
function returnString($a) {
// return the value $a as a string
// with enough digits to be "accurate" - that is, the value returned
// matches the value given to 1E-10
// there is a limit of 10 digits to cope with unexpected inputs
// and prevent an infinite loop
$conv_a = 0;
$digits=0;
while(abs($a - $conv_a) > 1e-10) {
$digits = $digits + 1;
$conv_a = 0 + number_format($a, $digits);
if($digits > 10) $conv_a = $a;
}
return $conv_a;
}
?>
Which produces
1.23456789
0.123456789
In the above code I arbitrarily assumed that being right to within 1E-10 was good enough. Obviously you can change this condition to whatever is appropriate for the numbers you encounter - and you could even make it an optional argument of your function.
Play with it - ask questions if this is not clear.

Calculate average without being thrown by strays

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?

You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.

Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P

Why don't you use the median? It's not 30, it's 21.5.

You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.

You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.

How to get a random value from 1~N but excluding several specific values in PHP?

rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.

No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.

I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small

Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.

The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>

What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.

Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose

This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Cosine similarity result above one - php

Due to the vagaries of floating point arithmetic, you could have calculations which, when represented in the binary form that computers use, are not exact. Probably you can just round down. Likewise for numbers slightly less than zero.

Related

PHP: How to raise number to (tiny) fractional exponent?

How to get log() of a very big number (PHP)?

PHP how to show all decimal places?

Calculate average without being thrown by strays

How to get a random value from 1~N but excluding several specific values in PHP?

Categories

Resources