I have a task which requires to generate the slope and interception of two sets of data by linear regression. According to the following link, it can be easily accomplished by R:
https://www.datacamp.com/community/tutorials/linear-regression-R
The codes are simply like
model <- lm(sales ~ youtube, data = marketing)
However, I will need to implement it in PHP. Is it possible ?
Normally you can do such operation easily in R.
But since you said you have to do it this time in PHP. You may use the following function:
<?php
function linear_regression($x, $y) {
// calculate number points
$n = count($x);
// ensure both arrays of points are the same size
if ($n != count($y)) {
trigger_error("linear_regression(): Number of elements in coordinate arrays do not match.", E_USER_ERROR);
}
// calculate sums
$x_sum = array_sum($x);
$y_sum = array_sum($y);
$xx_sum = 0;
$xy_sum = 0;
for($i = 0; $i < $n; $i++) {
$xy_sum+=($x[$i]*$y[$i]);
$xx_sum+=($x[$i]*$x[$i]);
}
// calculate slope
$m = (($n * $xy_sum) - ($x_sum * $y_sum)) / (($n * $xx_sum) - ($x_sum * $x_sum));
// calculate intercept
$b = ($y_sum - ($m * $x_sum)) / $n;
// return result
return array("m"=>$m, "b"=>$b);
}
?>
As an example, you may use the following codes to get the slope and intercept of the two sets of data:
$a=array();
$b=array();
array_push($a,$adata1);
array_push($a,$adata2);
array_push($a,$adata3);
array_push($a,$adata4);
array_push($b,$bdata1);
array_push($b,$bdata2);
array_push($b,$bdata3);
array_push($b,$bdata4);
$aa= linear_regression($a, $b);
$slope1= $aa["m"];
$intercept1= $aa["b"];
Related
By stumbling on this so thread i decided to write similar test in PHP.
My test code is this:
// Slow version
$t1 = microtime(true);
for ($n = 0, $i = 0; $i < 20000000; $i++) {
$n += 2 * ($i * $i);
}
$t2 = microtime(true);
echo "n={$n}\n";
// Optimized version
$t3 = microtime(true);
for ($n = 0, $i = 0; $i < 20000000; $i++) {
$n += $i * $i;
}
$n *= 2;
$t4 = microtime(true);
echo "n={$n}\n";
$speedup = round(100 * (($t2 - $t1) - ($t4 - $t3)) / ($t2 - $t1), 0);
echo "speedup: {$speedup}%\n";
Results
in PHP 2 * ($i * $i) version runs quite similar like 2 * $i * $i,
so PHP interpreter isn't optimizing bytecode as JVM in Java
Even when I optimized code manually - I've got ~ 8% speedup, when
Java's version gets ~ 16% speedup. So PHP version gets about 1/2 speedup factor of that in Java's code.
Rationale for optimization
I will not go into many details, but ratio of multiplications in optimized and un-optimized code is ->
1 summation : 3/4
2 summations: 4/6
3 summations: 5/8
4 summations: 6/10
...
And in general:
where n is number of summations in a loop. To be formula useful to us - we need to calculate limit of it when N approaches infinity (to replicate situation that we do A LOT of summations in a loop). So :
So we get conclusion that in optimized code there must be 50% less multiplications.
Questions
Why PHP interpreter isn't applying code optimization ?
Why PHP speedup factor is just half of that in Java ?
It's time to analyze PHP opcodes which are generated by PHP interpreter. For that you need to install VLD extension and use it from command line to generate opcodes of php script at hand.
Opcode analysis
Seems that $i++ is not the same as ++$i in terms of opcodes and memory usage. Statement $i++; generates opcodes:
POST_INC ~4 !1
FREE ~4
Increases counter by 1 and saves previous value into memory slot #4. Then, because this value is never used - frees it from memory. Question - why do we need to store value if it is never used ?
Seems that indeed there is a loop penalty, so we can gain additional performance by performing loop unrolling.
Optimized test code
Changing POST_INC into ASSIGN_ADD (which don't saves additional info in memory) and performing loop unrolling, gives use such test code :
while (true) {
// Slow version
$t1 = microtime(true);
for ($n = 0, $i = 0; $i < 2000; $i+=10) {
// loop unrolling
$n += 2 * (($i+0) * ($i+0));
$n += 2 * (($i+1) * ($i+1));
$n += 2 * (($i+2) * ($i+2));
$n += 2 * (($i+3) * ($i+3));
$n += 2 * (($i+4) * ($i+4));
$n += 2 * (($i+5) * ($i+5));
$n += 2 * (($i+6) * ($i+6));
$n += 2 * (($i+7) * ($i+7));
$n += 2 * (($i+8) * ($i+8));
$n += 2 * (($i+9) * ($i+9));
}
$t2 = microtime(true);
echo "{$n}\n";
// Optimized version
$t3 = microtime(true);
for ($n = 0, $i = 0; $i < 2000; $i+=10) {
// loop unrolling
$n += ($i+0) * ($i+0);
$n += ($i+1) * ($i+1);
$n += ($i+2) * ($i+2);
$n += ($i+3) * ($i+3);
$n += ($i+4) * ($i+4);
$n += ($i+5) * ($i+5);
$n += ($i+6) * ($i+6);
$n += ($i+7) * ($i+7);
$n += ($i+8) * ($i+8);
$n += ($i+9) * ($i+9);
}
$n *= 2;
$t4 = microtime(true);
echo "{$n}\n";
$speedup = round(100 * (($t2 - $t1) - ($t4 - $t3)) / ($t2 - $t1), 0);
$table[$speedup]++;
echo "****************\n";
foreach ($table as $s => $c) {
if ($s >= 0 && $s <= 20)
echo "$s,$c\n";
}
}
Results
Script aggregates number of times CPU hit into one or other speedup value.
When CPU hits vs Speedup is drawn as a graph, we get such picture:
So it is most likely that script will get 10% speedup. This means that our optimizations resulted in +2% speedup (compared to original scripts 8%).
Expectations
I'm pretty sure that all these things i've done - could be done automatically by a PHP JIT'er. I don't think that it's hard to change automatically a pair of POST_INC/FREE opcodes into one PRE_INC opcode when generating binary executable. Also it's not a miracle that PHP JIT'er could apply loop unrolling. And this is just a start of optimizations !
Hopefully there will be a JIT'er in PHP 8.0
I tried to do something like that:
$total = 0;
for ($j = 0; $j < 1000; $j++) {
$x = $j / 1000;
$total += pow($x, 1500) * pow((1 - $x), 500);
}
$total is 0.
PHP can't work with too small float values. What can I do? Which libraries can I use?
The function
f(x) = x^1500 * (1-x)^500
has (logarithmic) derivative
f'(x)/f(x)=d/dx log(f(x))
= 1500/x - 500/(1-x)
which is zero for
x0 = 3/4
having the maximum value of
f(3/4) = 3^1500/2^4000 = exp(-1124.6702892376163)
= 10^(-488.4381005764309)
= 3.646694848749686e-489
Using that as reference value, one can now sum up
f(i/1000)/f(3/4)=exp(1500*log(i/1000)+500*log(1-i/1000)+1124.6702892376163)
giving a sum of 24.26257515625789 so that the desired result is
24.26257515625789*f(3/4)=8.847820783972776e-488
A practical way to compute such a sum would be to compute the list of logarithms (more python than PHP, look up the corresponding array operations)
logf = [ log(f(i/1000.0)) for i=1..999 ]
using the transformed logarithm of f, log(f(x))=1500*log(x)+500*log(1-x).
Then compute maxlogf = max(logf), extract the number N=floor(maxlogf/log(10)) of the decimal power and compute the sum as
sumfred = sum([ exp( logfx - N*log(10) ) for logfx in logf ])
so that the final result is sumfred*10^N.
I'm building a little app that analyze ebay historical prices of sold items
and for some keywords/items the range is very wide because the search is too broad or simply wrong, infected by item not properly related
eg.
search prices for iphone the results include either the phone, but
also the charger and accessories/unrelated items which adulterate the prices data...
so i have a range that goes form $5 fro a charger and 500$ for an
iphone
so, given that I will try to improve the search on my side, i'm wondering if there is math calculation to exclude the outliers
say I have
$1200
$549
$399
$519
$9
$599
$549
$9
$499
$399
$519
$99
$5
$5
how to i get the price range to be $300-$600 instead of $10-$800 or so...
her ebelow the current php im using...not sure if is the best
function remove_outliers($dataset, $magnitude = 1)
{
$count = count($dataset);
$mean = array_sum($dataset) / $count; // Calculate the mean
$deviation = sqrt(array_sum(array_map("sd_square", $dataset, array_fill(0, $count, $mean))) / $count) * $magnitude; // Calculate standard deviation and times by magnitude
return array_filter($dataset, function ($x) use ($mean, $deviation) {return ($x <= $mean + $deviation && $x >= $mean - $deviation);}); // Return filtered array of values that lie within $mean +- $deviation.
}
function sd_square($x, $mean)
{
return pow($x - $mean, 2);
}
function calculate_median($arr)
{
sort($arr);
$count = count($arr);
$middleval = floor(($count - 1) / 2);
if ($count % 2) {
$median = $arr[$middleval];
} else {
$low = $arr[$middleval];
$high = $arr[$middleval + 1];
$median = (($low + $high) / 2);
}
return $median;
}
$prices = remove_outliers($prices); //$prices is the array with all the prices stored
$trend = calculate_median($prices);
$trend = round(($trend));
$min = round(min($prices));
$max = round(max($prices));
I find this function useful. The $cleaness variable will give granularity
/**
* Returns an average value from a dirt list of numbers.
*
* #require
*
* $numbers = an array of numbers
* $cleaness = a percentage value
*
* #return integer
* an average value from a cleaner list.
*/
public function CleanAverage ( $numbers, $cleaness ) {
// A
$extremes_to_remove = floor(count($numbers)/100*$cleaness);
if ($extremes_to_remove < 2) {$extremes_to_remove = 2;}
// B
sort ($numbers) ;
// C
//remove $extremes from top
for ($i = 0; $i < ($extremes_to_remove/2); $i++) {
array_pop($numbers);
}
// D
// revers order
rsort($numbers);
// E
//remove $extremes from top
for ( $i = 0; $i < ($extremes_to_remove/2); $i++ ) {
array_pop($numbers);
}
// F
// average
$average = array_sum($numbers)/count($numbers);
return $average;
}
I was overlooking some code that I had written to generate an A-Z navigation on a product page, and the method in which it was done was a for loop; using ascii octals 65-91 and PHP's chr() function. I wondered if there was a simpler and/or more efficient way of doing this, and I discovered that PHP's range() function supports alphabetical ranges.
After I wrote my test code to compare the different methods, a few questions came to mind:
Does PHP store a static array of the alphabet?
How can I profile more deeply to look below the PHP layer to see
what's happening?
I have a cachegrind of the PHP script that can be attached if necessary, in addition to environment config. For those who might want to know the machine specs in which it was executed, here are some links:
root#workbox:~$ lshw
http://pastebin.com/cZZRjJcR
root#workbox:~$ sysinfo
http://pastebin.com/ihQkkPAJ
<?php
/*
* determine which method out of 3 for returning
* an array of uppercase alphabetic characters
* has the highest performance
*
* +++++++++++++++++++++++++++++++++++++++++++++
*
* 1) Array $alpha = for($x = 65; $x < 91; $x++) { $upperChr[] = chr($x); }
* 2) Array $alpha = range(chr(65), chr(90);
* 3) Array $alpha = range('A', 'Z');
*
* +++++++++++++++++++++++++++++++++++++++++++++
*
* test runs with iterations:
*
* 10,000:
* - 1) upperChrElapsed: 0.453785s
* - 2) upperRangeChrElapsed: 0.069262s
* - 3) upperRangeAZElapsed: 0.046110s
*
* 100,000:
* - 1) upperChrElapsed: 0.729015s
* - 2) upperRangeChrElapsed: 0.078652s
* - 3) upperRangeAZElapsed: 0.052071s
*
* 1,000,000:
* - 1) upperChrElapsed: 50.942950s
* - 2) upperRangeChrElapsed: 10.091785s
* - 3) upperRangeAZElapsed: 8.073058s
*/
ini_set('max_execution_time', 0);
ini_set('memory_limit', 0);
define('ITERATIONS', 1000000); // 1m loops x3
$upperChrStart = microtime(true);
for($i = 0; $i <= ITERATIONS; $i++) {
$upperChr = array();
for($x = 65; $x < 91; $x++) {
$upperChr[] = chr($x);
}
}
$upperChrElapsed = microtime(true) - $upperChrStart;
// +++++++++++++++++++++++++++++++++++++++++++++
$upperRangeChrStart = microtime(true);
for($i = 0; $i <= ITERATIONS; $i++) {
$upperRangeChr = range(chr(65), chr(90));
}
$upperRangeChrElapsed = microtime(true) - $upperRangeChrStart;
// +++++++++++++++++++++++++++++++++++++++++++++
$upperRangeAZStart = microtime(true);
for($i = 0; $i <= ITERATIONS; $i++) {
$upperRangeAZ = range('A', 'Z');
}
$upperRangeAZElapsed = microtime(true) - $upperRangeAZStart;
printf("upperChrElapsed: %f\n", $upperChrElapsed);
printf("upperRangeChrElapsed: %f\n", $upperRangeChrElapsed);
printf("upperRangeAZElapsed: %f\n", $upperRangeAZElapsed);
?>
Does PHP waste memory keeping an array of letters? I would doubt it. range() will work on a wide variety of values too.
If performance is an issue in such a case, you might want to declare the array outside of the loop so that it can be re-used. However, large gains rarely come from micro-optimizations. Using profiling on larger applications to get significant gains.
As for profiling at a lower level, you can simply use valgrind on PHP CLI. I've also seen it used on an apache process.
Related: How to profile my C++ application on linux
So I've read the two related questions for calculating a trend line for a graph, but I'm still lost.
I have an array of xy coordinates, and I want to come up with another array of xy coordinates (can be fewer coordinates) that represent a logarithmic trend line using PHP.
I'm passing these arrays to javascript to plot graphs on the client side.
Logarithmic Least Squares
Since we can convert a logarithmic function into a line by taking the log of the x values, we can perform a linear least squares curve fitting. In fact, the work has been done for us and a solution is presented at Math World.
In brief, we're given $X and $Y values that are from a distribution like y = a + b * log(x). The least squares method will give some values aFit and bFit that minimize the distance from the parametric curve to the data points given.
Here is an example implementation in PHP:
First I'll generate some random data with known underlying distribution given by $a and $b
// True parameter valaues
$a = 10;
$b = 5;
// Range of x values to generate
$x_min = 1;
$x_max = 10;
$nPoints = 50;
// Generate some random points on y = a * log(x) + b
$X = array();
$Y = array();
for($p = 0; $p < $nPoints; $p++){
$x = $p / $nPoints * ($x_max - $x_min) + $x_min;
$y = $a + $b * log($x);
$X[] = $x + rand(0, 200) / ($nPoints * $x_max);
$Y[] = $y + rand(0, 200) / ($nPoints * $x_max);
}
Now, here's how to use the equations given to estimate $a and $b.
// Now convert to log-scale for X
$logX = array_map('log', $X);
// Now estimate $a and $b using equations from Math World
$n = count($X);
$square = create_function('$x', 'return pow($x,2);');
$x_squared = array_sum(array_map($square, $logX));
$xy = array_sum(array_map(create_function('$x,$y', 'return $x*$y;'), $logX, $Y));
$bFit = ($n * $xy - array_sum($Y) * array_sum($logX)) /
($n * $x_squared - pow(array_sum($logX), 2));
$aFit = (array_sum($Y) - $bFit * array_sum($logX)) / $n;
You may then generate points for your Javascript as densely as you like:
$Yfit = array();
foreach($X as $x) {
$Yfit[] = $aFit + $bFit * log($x);
}
In this case, the code estimates bFit = 5.17 and aFit = 9.7, which is quite close for only 50 data points.
For the example data given in the comment below, a logarithmic function does not fit well.
The least squares solution is y = -514.734835478 + 2180.51562281 * log(x) which is essentially a line in this domain.
I would recommend using library: http://www.drque.net/Projects/PolynomialRegression/
Available by Composer: https://packagist.org/packages/dr-que/polynomial-regression.
In case anyone is having problems with the create_function, here is how I edited it. (Though I wasn't using logs, so I did take those out.)
I also reduced the number of calculations and added an R2. It seems to work so far.
function lsq(){
$X = array(1,2,3,4,5);
$Y = array(.3,.2,.7,.9,.8);
// Now estimate $a and $b using equations from Math World
$n = count($X);
$mult_elem = function($x,$y){ //anon function mult array elements
$output=$x*$y; //will be called on each element
return $output;
};
$sumX2 = array_sum(array_map($mult_elem, $X, $X));
$sumXY = array_sum(array_map($mult_elem, $X, $Y));
$sumY = array_sum($Y);
$sumX = array_sum($X);
$bFit = ($n * $sumXY - $sumY * $sumX) /
($n * $sumX2 - pow($sumX, 2));
$aFit = ($sumY - $bFit * $sumX) / $n;
echo ' intercept ',$aFit,' ';
echo ' slope ',$bFit,' ' ;
//r2
$sumY2 = array_sum(array_map($mult_elem, $Y, $Y));
$top=($n*$sumXY-$sumY*$sumX);
$bottom=($n*$sumX2-$sumX*$sumX)*($n*$sumY2-$sumY*$sumY);
$r2=pow($top/sqrt($bottom),2);
echo ' r2 ',$r2;
}