How to optimise a Exponential Moving Average algorithm in PHP? - php

I'm trying to retrieve the last EMA of a large dataset (15000+ values). It is a very resource-hungry algorithm since each value depends on the previous one. Here is my code :
$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
$lastEMA = $lastEMA + $k * ($data[$i]-$lastEMA);
}
What I already did:
Isolate $k so it is not computed 10000+ times
Keep only the latest computed EMA, and not keep all of them in an array
use for() instead of foreach()
the $data[] array doesn't have keys; it's a basic array
This allowed me to reduced execution time from 2000ms to about 500ms for 15000 values!
What didn't work:
Use SplFixedArray(), this shaved only ~10ms executing 1,000,000 values
Use PHP_Trader extension, this returns an array containing all the EMAs instead of just the latest, and it's slower
Writing and running the same algorithm in C# and running it over 2,000,000 values takes only 13ms! So obviously, using a compiled, lower-level language seems to help ;P
Where should I go from here? The code will ultimately run on Ubuntu, so which language should I choose? Will PHP be able to call and pass such a huge argument to the script?

Clearly implementing with an extension gives you a significant boost.
Additionally the calculus can be improved as itself and that gain you can add in whichever language you choose.
It is easy to see that lastEMA can be calculated as follows:
$lastEMA = 0;
$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
$lastEMA = (1-$k) * $lastEMA + $k * $data[$i];
}
This can be rewritten as follows in order to take out of the loop as most as possible:
$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
$lastEMA = $k1m * $lastEMA + $data[$i];
}
$lastEMA = $lastEMA * $k;
To explain the extraction of the "$k" think that in the previous formulation is as if all the original raw data are multiplied by $k so practically you can instead multiply the end result.
Note that, rewritten in this way, you have 2 operations inside the loop instead of 3 (to be precise inside the loop there are also $i increment, $i comparison with $size_data and $lastEMA value assignation) so this way you can expect to achieve an additional speedup in the range between the 16% and 33%.
Further there are other improvements that can be considered at least in some circumstances:
Consider only last values
The first values are multiplied several times by $k1m = 1 - $k so their contribute may be little or even go under the floating point precision (or the acceptable error).
This idea is particularly helpful if you can do the assumption that older data are of the same order of magnitude as the newer because if you consider only the last $n values the error that you make is
$err = $EMA_of_discarded_data * (1-$k) ^ $n.
So if order of magnitude is broadly the same we can tell that the relative error done is
$rel_err = $err / $lastEMA = $EMA_of_discarded_data * (1-$k) ^ $n / $lastEMA
that is almost equal to simply (1-$k) ^ $n.
Under the assumption that "$lastEMA almost equal to $EMA_of_discarded_data":
Let's say that you can accept a relative error $rel_err
you can safely consider only the last $n values where (1 - $k)^$n < $rel_err.
Means that you can pre-calculate (before the loop) $n = log($rel_err) / log (1-$k) and compute all only considering the last $n values.
If the dataset is very big this can give a sensible speedup.
Consider that for 64 bit floating point numbers you have a relative precision (related to the mantissa) that is 2^-53 (about 1.1e-16 and only 2^-24 = 5.96e-8 for 32 bit floating point numbers) so you cannot obtain better than this relative error
so basically you should never have an advantage in calculating more than $n = log(1.1e-16) / log(1-$k) values.
to give an example if $range = 2000 then $n = log(1.1e-16) / log(1-2/2001) = 36'746.
I think that is interesting to know that extra calculations would go lost inside the roundings ==> it is useless ==> is better not to do.
now one example for the case where you can accept a relative error larger than floating point precision $rel_err = 1ppm = 1e-6 = 0.00001% = 6 significant decimal digits you have $n = log(1.1e-16) / log(1-2/2001) = 13'815
I think is quite a little number compared to your last samples numbers so in that cases the speedup could be evident (I'm assuming that $range = 2000 is meaningful or high for your application but thi I cannot know).
just other few numbers because I do not know what are your typical figures:
$rel_err = 1e-3; $range = 2000 => $n = 6'907
$rel_err = 1e-3; $range = 200 => $n = 691
$rel_err = 1e-3; $range = 20 => $n = 69
$rel_err = 1e-6; $range = 2000 => $n = 13'815
$rel_err = 1e-6; $range = 200 => $n = 1'381
$rel_err = 1e-6; $range = 20 => $n = 138
If the assumption "$lastEMA almost equal to $EMA_of_discarded_data" cannot be taken things are less easy but since the advantage cam be significant it can be meaningful to go on:
we need to re-consider the full formula: $rel_err = $EMA_of_discarded_data * (1-$k) ^ $n / $lastEMA
so $n = log($rel_err * $lastEMA / $EMA_of_discarded_data) / log (1-$k) = (log($rel_err) + log($lastEMA / $EMA_of_discarded_data)) / log (1-$k)
the central point is to calculate $lastEMA / $EMA_of_discarded_data (without actually calculating $lastEMA nor $EMA_of_discarded_data of course)
one case is when we know a-priori that for example $EMA_of_discarded_data / $lastEMA < M (for example M = 1000 or M = 1e6)
in that case $n < (log($rel_err/M)) / log (1-$k)
if you cannot give any M number
you have to find a good idea to over-estimate $EMA_of_discarded_data / $lastEMA
one quick way could be to take M = max(data) / min(data)
Parallelization
The calculation can be re-written in a form where it is a simple addition of independent terms:
$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
$lastEMA += $k1m ^ ($size_data - 1 - $i) * $data[$i];
}
$lastEMA = $lastEMA * $k;
So if the implementing language supports parallelization the dataset can be divided in 4 (or 8 or n ...basically the number of CPU cores available) chunks and it can be computed the sum of terms on each chunk in parallel summing up the individual results at the end.
I do not go in detail with this since this reply is already terribly long and I think the concept is already expressed.

Building your own extension definitely improves performance. Here's a good tutorial from the Zend website.
Some performance figures: Hardware: Ubuntu 14.04, PHP 5.5.9, 1-core Intel CPU#3.3Ghz, 128MB RAM (it's a VPS).
Before (PHP only, 16,000 values) : 500ms
C Extension, 16,000 values : 0.3ms
C Extension (100,000 values) : 3.7ms
C Extension (500,000 values) : 28.0ms
But I'm memory limited at this point, using 70MB. I will fix that and update the numbers accordingly.

Related

Why is my Python code 100 times slower than the same code in PHP?

I have two points (x1 and x2) and want to generate a normal distribution in a given step count. The sum of y values for the x values between x1 and x2 is 1. To the actual problem:
I'm fairly new to Python and wonder why the following code produces the desired result, but about 100x slower than the same program in PHP. There are about 2000 x1-x2 pairs and about 5 step values per pair.
I tried to compile with Cython, used multiprocessing but it just improved things 2x, which is still 50x slower than PHP. Any suggestions how to improve speed to match at least PHP performance?
from scipy.stats import norm
import numpy as np
import time
# Calculates normal distribution
def calculate_dist(x1, x2, steps, slope):
points = []
range = np.linspace(x1, x2, steps+2)
for x in range:
y = norm.pdf(x, x1+((x2-x1)/2), slope)
points.append([x, y])
sum = np.array(points).sum(axis=0)[1]
norm_points = []
for point in points:
norm_points.append([point[0], point[1]/sum])
return norm_points
start = time.time()
for i in range(0, 2000):
for j in range(10, 15):
calculate_dist(0, 1, j, 0.15)
print(time.time() - start) # Around 15 seconds or so
Edit, PHP Code:
$start = microtime(true);
for ($i = 0; $i<2000; $i++) {
for ($j = 10; $j<15; $j++) {
$x1 = 0; $x2 = 1; $steps = $j; $slope = 0.15;
$step = abs($x2-$x1) / ($steps + 1);
$points = [];
for ($x = $x1; $x <= $x2 + 0.000001; $x += $step) {
$y = stats_dens_normal($x, $x1 + (($x2 - $x1) / 2), $slope);
$points[] = [$x, $y];
}
$sum = 0;
foreach ($points as $point) {
$sum += $point[1];
}
$norm_points = [];
foreach ($points as &$point) {
array_push($norm_points, [$point[0], $point[1] / $sum]);
}
}
}
return microtime(true) - $start; # Around 0.1 seconds or so
Edit 2, profiled each line and found that norm.pdf() was taking 98% of time, so found a custom normpdf function and defined it, now time is around 0.67s which is considerably faster, but still around 10x slower than PHP. Also I think redefining common functions goes against the idea of Pythons simplicity?!
The custom function (source is some other Stackoverflow answer):
from math import sqrt, pi, exp
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
The answer is, you aren't using the right tools/data structures for the tasks in python.
Calling numpy functionality has quite an overhead (scipy.stats.norm.pdf uses numpy under the hood) in python and thus one would never call this functions for one element but for the whole array (so called vectorized computation), that means instead of
for x in range:
y = norm.pdf(x, x1+((x2-x1)/2), slope)
ys.append(y)
one would rather use:
ys = norm.pdf(x,x1+((x2-x1)/2), slope)
calculating pdf for all elements in x and paying the overhead only once rather than len(x) times.
For example to calculate pdf for 10^4 elements takes less than 10 times more time than for one element:
%timeit norm.pdf(0) # 68.4 µs ± 1.62 µs
%timeit norm.pdf(np.zeros(10**4)) # 415 µs ± 12.4 µs
Using vectorized computation will not only make your program faster but often also shorter/easier to understand, for example:
def calculate_dist_vec(x1, x2, steps, slope):
x = np.linspace(x1, x2, steps+2)
y = norm.pdf(x, x1+((x2-x1)/2), slope)
ys = y/np.sum(y)
return x,ys
Using this vectorized version gives you a speed-up around 10.
The problem: norm.pdf is optimized for long vectors (nobody really cares how fast/slow it is for 10 elements if it is very fast for one million elements), but your test is biased against numpy, because it uses/creates only short arrays and thus norm.pdf cannot shine.
So if it is really about small arrays and you are serious about speeding it up you will have to roll out your own version of norm.pdf Using cython for creating this fast and specialized function might be worth a try.

Store a playing cards deck in MySQL (single column)

I'm doing a game with playing cards and have to store shuffled decks in MySQL.
What is the most efficient way to store a deck of 52 cards in a single column? And save/retrieve those using PHP.
I need 6 bits to represent a number from 0 to 52 and thus thought of saving the deck as binary data but I've tried using PHP's pack function without much luck. My best shot is saving a string of 104 characters (52 zero-padded integers) but that's far from optimal.
Thanks.
I do agree that it's not necessary or rather impractical to do any of this, but for fun, why not ;)
From your question I'm not sure if you aim to have all cards encoded as one value and stored accordingly or whether you want to encode cards individually. So I assume the former.
Further I assume you have a set 52 cards (or items) that you represent with an integer value between 1 and 52.
I see some approaches, outlined as follows, not all actually for the better of using less space, but included for the sake of being complete:
using a comma separated list (CSV), total length of 9+2*42+51 = 144 characters
turning each card into a character ie a card is represented with 8 bits, total length of 52 characters
encoding each card with those necessary 6 bits and concatenating just the bits without the otherwise lost 2 bits (as in the 2nd approach), total length of 39 characters
treat the card-ids as coefficients in a polynomial of form p(cards) = cards(1)*52^51 + cards(2)*52^50 + cards(3)*52^49 + ... + cards(52)*52^0 which we use to identify the card-set. Roughly speaking p(cards) must lie in the value range of [0,52^52], which means that the value can be represented with log(52^52)/log(2) = 296.422865343.. bits or with a byte sequence of length 37.052 respectively 38.
There naturally are further approaches, taking into account mere practical, technical or theoretical aspects, as is also visible through the listed approaches.
For the theoretical approaches (which I consider the most interesting) it is helpful to know a bit about information theory and entropy. Essentially, depending on what is known about a problem, no further information is required, respectively only the information to clarify all remaining uncertainty is needed.
As we are working with bits and bytes, it mostly is interesting for us in terms of memory usage which practically speaking is bit- or byte-based (if you consider only bit-sequences without the underlying technologies and hardware); that is, bits represent one of two states, ergo allow the differentiation of two states. This is trivial but important, actually :/
then, if you want to represent N states in a base B, you will need log(N) / log(B) digits in that base, or in your example log(52) / log(2) = 5.70.. -> 6 bits. you will notice that actually only 5.70.. bits would be required, which means with 6 bits we actually have a loss.
Which is the moment the problem transformation comes in: instead of representing 52 states individually, the card set as a whole can be represented. The polynomial approach is a way to do this. Essentially it works as it assumes a base of 52, ie where the card set is represented as 1'4'29'31.... or mathematically speaking: 52 + 1 = 1'1 == 53 decimal, 1'1 + 1 = 1'2 == 54 decimal, 1'52'52 + 1 = 2'1'1 == 5408 decimal,
But if you further look at the polynomial-approach you will notice that there is a total of 52^52 possible values whereas we would only ever use 52! = 1*2*3*...*52 because once a card is fixed the remaining possibilities decrease, respectively the uncertainty or entropy decreases. (please note that 52! / 52^52 = 4.7257911e-22 ! which means the polynomial is a total waste of space).
If we now were to use a value in [1,52!] which is pretty much the theoretical minimum, we could represent the card set with log(52!) / log(2) = 225.581003124.. bits = 28.1976.. bytes. Problem with that is, that any of the values represented as such does not contain any structure from which we can derive its semantics, which means that for each of the 52! possible values (well 52! - 1, if you consider the principle of exclusion) we need a reference of its meaning, ie a lookup table of 52! values and that would certainly be a memory overkill.
Although we can make a compromise with the knowledge of the decreasing entropy of an encoded ordered set. As an example: we sequentially encode each card with the minimum number of bits required at that point in the sequence. So assume N<=52 cards remain, then in each step a card can be represented in log(N)/log(2) bits, meaning that the number of required bits decreases, until for the last card, you don't need a bit in the first place. This would give about (please correct)..
20 * 6 bits + 16 * 5 bits + 8 * 4 bits + 4 * 3 bits + 2 * 2 bits + 1 bit = 249 bits = 31.125.. bytes
But still there would be a loss because of the partial bits used unnecessarily, but the structure in the data totally makes up for that.
So a question might be, hej can we combine the polynomial with this??!?11?! Actually, I have to think about that, I'm getting tired.
Generally speaking, knowing about the structure of a problem drastically helps decreasing the necessary memory space. Practically speaking, in this day and age, for your average high-level developer such low level considerations are not so important (hej, 100kByte of wasted space, so what!) and other considerations are weighted higher; also because the underlying technologies are often reducing memory usage by themselves, be it your filesystem or gzip-ed web-server responses, etc. The general knowledge of these kind of things though is still helpful in creating your services and datastructures.
But these latter approaches are very problem-specific "compression procedures"; general compression works differently, where as an example approach the procedures sequencially run through the bytes of the data and for any unseen bit sequences add these to a lookup table and represent the actual sequence with an index (as a sketch).
Well enough of funny talk, let's get technical!
1st approach "csv"
// your unpacked card set
$cards = range(1,52);
$coded = implode(',',$cards);
$decoded = explode(',',$coded);
2nd approach: 1 card = 1 character
// just a helper
// (not really necessary, but using this we can pretty print the resulting string)
function chr2($i)
{
return chr($i + 32);
}
function myPack($cards)
{
$ar = array_map('chr2',$cards);
return implode('',$ar);
}
function myUnpack($str)
{
$set = array();
$len = strlen($str);
for($i=0; $i<$len; $i++)
$set[] = ord($str[$i]) - 32; // adjust this shift along with the helper
return $set;
}
$str = myPack($cards);
$other_cards = myUnpack($str);
3rd approach, 1 card = 6 bits
$set = ''; // target string
$offset = 0;
$carry = 0;
for($i=0; $i < 52; $i++)
{
$c = $cards[$i];
switch($offset)
{
case 0:
$carry = ($c << 2);
$next = null;
break;
case 2:
$next = $carry + $c;
$carry = 0;
break;
case 4:
$next = $carry + ($c>>2);
$carry = ($c << 6) & 0xff;
break;
case 6:
$next = $carry + ($c>>4);
$carry = ($c << 4) & 0xff;
break;
}
if ($next !== null)
{
$set .= chr($next);
}
$offset = ($offset + 6) % 8;
}
// and $set it is!
$new_cards = array(); // the target array for cards to be unpacked
$offset = 0;
$carry = 0;
for($i=0; $i < 39; $i++)
{
$o = ord(substr($set,$i,1));
$new = array();
switch($offset)
{
case 0:
$new[] = ($o >> 2) & 0x3f;
$carry = ($o << 4) & 0x3f;
break;
case 4:
$new[] = (($o >> 6) & 3) + $carry;
$new[] = $o & 0x3f;
$carry = 0;
$offset += 6;
break;
case 6:
$new[] = (($o >> 4) & 0xf) + $carry;
$carry = ($o & 0xf) << 2;
break;
}
$new_cards = array_merge($new_cards,$new);
$offset = ($offset + 6) % 8;
}
4th approach, the polynomial, just outlined (please consider using bigints because of the integer overflow)
$encoded = 0;
$base = 52;
foreach($cards as $c)
{
$encoded = $encoded*$base + $c;
}
// and now save the binary representation
$decoded = array();
for($i=0; $i < 52; $i++)
{
$v = $encoded % $base;
$encoded = ($encoded - $v) / $base;
array_shift($v, $decoded);
}

PHP algorithm to solve a system of linear equations of grade 1

I have a system of equations of grade 1 to resolve in PHP.
There are more equations than variables but there aren't less equations than variables.
The system would look like bellow. n equations, m variables, variables are x[i] where 'i' takes values from 1 to m. The system may have a solution or not.
m may be maximum 100 and n maximum ~5000 (thousands).
I will have to resolve like a few thousands of these systems of equations. Speed may be a problem but I'm looking for an algorithm written in PHP for now.
a[1][1] * x[1] + a[1][2] * x[2] + ... + a[1][m] * x[m] = number 1
a[2][1] * x[1] + a[2][2] * x[2] + ... + a[2][m] * x[m] = number 2
...
a[n][1] * x[1] + a[n][2] * x[2] + ... + a[n][m] * x[m] = number n
There is Cramer Rule which may do it. I could make 1 square matrix of coefficients, resolve the system with Cramer Rule (by calculating matrices' determinants) and than I should check the values in the unused equations.
I believe I could try Cramer by myself but I'm looking for a better solution.
This is a problem of Computational Science,
http://en.wikipedia.org/wiki/Computational_science#Numerical_simulations
I know there are some complex algorithms to solve my problem but I can't tell which one would do it and which is the best for my case. An algorithm would use me better than just the theory with the demonstration.
My question is, does anybody know a class, script, code of some sort written in PHP to resolve a system of linear equations of grade 1 ?
Alternatively I could try an API or a Web Service, best to be free, a paid one would do it too.
Thank you
I needed exactly this, but I couldn't find determinant function, so I made one myself. And the Cramer rule function too. Maybe it'll help someone.
/**
* $matrix must be 2-dimensional n x n array in following format
* $matrix = array(array(1,2,3),array(1,2,3),array(1,2,3))
*/
function determinant($matrix = array()) {
// dimension control - n x n
foreach ($matrix as $row) {
if (sizeof($matrix) != sizeof($row)) {
return false;
}
}
// count 1x1 and 2x2 manually - rest by recursive function
$dimension = sizeof($matrix);
if ($dimension == 1) {
return $matrix[0][0];
}
if ($dimension == 2) {
return ($matrix[0][0] * $matrix[1][1] - $matrix[0][1] * $matrix[1][0]);
}
// cycles for submatrixes calculations
$sum = 0;
for ($i = 0; $i < $dimension; $i++) {
// for each "$i", you will create a smaller matrix based on the original matrix
// by removing the first row and the "i"th column.
$smallMatrix = array();
for ($j = 0; $j < $dimension - 1; $j++) {
$smallMatrix[$j] = array();
for ($k = 0; $k < $dimension; $k++) {
if ($k < $i) $smallMatrix[$j][$k] = $matrix[$j + 1][$k];
if ($k > $i) $smallMatrix[$j][$k - 1] = $matrix[$j + 1][$k];
}
}
// after creating the smaller matrix, multiply the "i"th element in the first
// row by the determinant of the smaller matrix.
// odd position is plus, even is minus - the index from 0 so it's oppositely
if ($i % 2 == 0){
$sum += $matrix[0][$i] * determinant($smallMatrix);
} else {
$sum -= $matrix[0][$i] * determinant($smallMatrix);
}
}
return $sum;
}
/**
* left side of equations - parameters:
* $leftMatrix must be 2-dimensional n x n array in following format
* $leftMatrix = array(array(1,2,3),array(1,2,3),array(1,2,3))
* right side of equations - results:
* $rightMatrix must be in format
* $rightMatrix = array(1,2,3);
*/
function equationSystem($leftMatrix = array(), $rightMatrix = array()) {
// matrixes and dimension check
if (!is_array($leftMatrix) || !is_array($rightMatrix)) {
return false;
}
if (sizeof($leftMatrix) != sizeof($rightMatrix)) {
return false;
}
$M = determinant($leftMatrix);
if (!$M) {
return false;
}
$x = array();
foreach ($rightMatrix as $rk => $rv) {
$xMatrix = $leftMatrix;
foreach ($rightMatrix as $rMk => $rMv) {
$xMatrix[$rMk][$rk] = $rMv;
}
$x[$rk] = determinant($xMatrix) / $M;
}
return $x;
}
Wikipedia should have pseudocode for reducing the matrix representing your equations to reduced row echelon form. Once the matrix is in that form, you can walk through the rows to find a solution.
There's an unmaintained PEAR package which may save you the effort of writing the code.
Another question is whether you are looking mostly at "wide" systems (more variables than equations, which usually have many possible solutions) or "narrow" systems (more equations than variables, which usually have no solutions), since the best strategy depends on which case you are in — and narrow systems may benefit from using a linear regression technique such as least squares instead.
This package uses Gaussian Elimination. I found that it executes fast for larger matrices (i.e. more variables/equations).
There is a truly excellent package based on JAMA here: http://www.phpmath.com/build02/JAMA/docs/index.php
I've used it for simple linear right the way to highly complex Multiple Linear Regression (writing my own Backwards Stepwise MLR functions on top of that). Very comprehensive and will hopefully do what you need.
Speed could be considered an issue, for sure. But works a treat and matched SPSS when I cross referenced results on the BSMLR calculations.

Calculating Floating Point Powers (PHP/BCMath)

I'm writing a wrapper for the bcmath extension, and bug #10116 regarding bcpow() is particularly annoying -- it casts the $right_operand ($exp) to an (native PHP, not arbitrary length) integer, so when you try to calculate the square root (or any other root higher than 1) of a number you always end up with 1 instead of the correct result.
I started searching for algorithms that would allow me to calculate the nth root of a number and I found this answer which looks pretty solid, I actually expanded the formula using WolframAlpha and I was able to improve it's speed by about 5% while keeping the accuracy of the results.
Here is a pure PHP implementation mimicking my BCMath implementation and its limitations:
function _pow($n, $exp)
{
$result = pow($n, intval($exp)); // bcmath casts $exp to (int)
if (fmod($exp, 1) > 0) // does $exp have a fracional part higher than 0?
{
$exp = 1 / fmod($exp, 1); // convert the modulo into a root (2.5 -> 1 / 0.5 = 2)
$x = 1;
$y = (($n * _pow($x, 1 - $exp)) / $exp) - ($x / $exp) + $x;
do
{
$x = $y;
$y = (($n * _pow($x, 1 - $exp)) / $exp) - ($x / $exp) + $x;
} while ($x > $y);
return $result * $x; // 4^2.5 = 4^2 * 4^0.5 = 16 * 2 = 32
}
return $result;
}
The above seems to work great except when 1 / fmod($exp, 1) doesn't yield an integer. For example, if $exp is 0.123456, its inverse will be 8.10005 and the outcome of pow() and _pow() will be a bit different (demo):
pow(2, 0.123456) = 1.0893412745953
_pow(2, 0.123456) = 1.0905077326653
_pow(2, 1 / 8) = _pow(2, 0.125) = 1.0905077326653
How can I achieve the same level of accuracy using "manual" exponential calculations?
The employed algorithm to find the nth root of a (positive) number a is the Newton algorithm for finding the zero of
f(x) = x^n - a.
That involves only powers with natural numbers as exponents, hence is straightforward to implement.
Calculating a power with an exponent 0 < y < 1 where y is not of the form 1/n with an integer n is more complicated. Doing the analogue, solving
x^(1/y) - a == 0
would again involve calculating a power with non-integral exponent, the very problem we're trying to solve.
If y = n/d is rational with small denominator d, the problem is easily solved by calculating
x^(n/d) = (x^n)^(1/d),
but for most rational 0 < y < 1, numerator and denominator are rather large, and the intermediate x^n would be huge, so the computation would use a lot of memory and take a (relatively) long time.
(For the example exponent of 0.123456 = 1929/15625, it's not too bad, but 0.1234567 would be rather taxing.)
One way to calculate the power for general rational 0 < y < 1 is to write
y = 1/a ± 1/b ± 1/c ± ... ± 1/q
with integers a < b < c < ... < q and to multiply/divide the individual x^(1/k). (Every rational 0 < y < 1 has such representations, and the shortest such representations generally don't involve many terms, e.g.
1929/15625 = 1/8 - 1/648 - 1/1265625;
using only additions in the decomposition leads to longer representations with larger denominators, e.g.
1929/15625 = 1/9 + 1/82 + 1/6678 + 1/46501020 + 1/2210396922562500,
so that would involve more work.)
Some improvement is possible by mixing the approaches, first find a close rational approximation to y with small denominator via the continued fraction expansion of y - for the example exponent 1929/15625 = [0;8,9,1,192] and using the first four partial quotients yields the approximation 10/81 = 0.123456790123... [note that 10/81 = 1/8 - 1/648, the partial sums of the shortest decomposition into pure fractions are convergents] - and then decompose the remainder into pure fractions.
However, in general that approach leads to calculating nth roots for large n, which also is slow and memory-intensive if the desired accuracy of the final result is high.
All in all, it is probably simpler and faster to implement exp and log and use
x^y = exp(y*log(x))

Calculate average without being thrown by strays

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.

Categories