Some background
I was having a go at the common "MaxProfit" programming challenge. It basically goes like this:
Given a zero-indexed array A consisting of N integers containing daily
prices of a stock share for a period of N consecutive days, returns
the maximum possible profit from one transaction during this period.
I was quite pleased with this PHP algorithm I came up, having avoided the naive brute-force attempt:
public function maxProfit($prices)
{
$maxProfit = 0;
$key = 0;
$n = count($prices);
while ($key < $n - 1) {
$buyPrice = $prices[$key];
$maxFuturePrice = max( array_slice($prices, $key+1) );
$profit = $maxFuturePrice - $buyPrice;
if ($profit > $maxProfit) $maxProfit = $profit;
$key++;
}
return $maxProfit;
}
However, having tested my solution it seems to perform badly performance-wise, perhaps even in O(n2) time.
I did a bit of reading around the subject and discovered a very similar python solution. Python has some quite handy array abilities which allow splitting an array with a a[s : e] syntax, unlike in PHP where I used the array_slice function. I decided this must be the bottleneck so I did some tests:
Tests
PHP array_slice()
$n = 10000;
$a = range(0,$n);
$start = microtime(1);
foreach ($a as $key => $elem) {
$subArray = array_slice($a, $key);
}
$end = microtime(1);
echo sprintf("Time taken: %sms", round(1000 * ($end - $start), 4)) . PHP_EOL;
Results:
$ php phpSlice.php
Time taken: 4473.9199ms
Time taken: 4474.633ms
Time taken: 4499.434ms
Python a[s : e]
import time
n = 10000
a = range(0, n)
start = time.time()
for key, elem in enumerate(a):
subArray = a[key : ]
end = time.time()
print "Time taken: {0}ms".format(round(1000 * (end - start), 4))
Results:
$ python pySlice.py
Time taken: 213.202ms
Time taken: 212.198ms
Time taken: 215.7381ms
Time taken: 213.8121ms
Question
Why is PHP's array_slice() around 20x less efficient than Python?
Is there an equivalently efficient method in PHP that achieves the above and thus hopefully makes my maxProfit algorithm run in O(N) time? Edit I realise my implementation above is not actually O(N), but my question still stands regarding the efficiency of slicing arrays.
I don't really know, but PHP's arrays are messed up hybrid monsters, maybe that's why. Python's lists are really just lists, not at the same time dictionaries, so they might be simpler/faster because of that.
Yes, do an actual O(n) solution. Your solution isn't just slow because PHP's slicing is apparently slow, it's slow because you obviously have an O(n^2) algorithm. Just walk over the array once, keep track of the minimum price found so far, and check it with the current price. Not something like max over half the array in every single loop iteration.
Related
I'm doing a calculation in PHP using bcmath, and need to raise e by a fractional exponent. Unfortunately, bcpow() only accepts integer exponents. The exponent is typically higher precision than a float will allow, so normal arithmetic functions won't cut it.
For example:
$e = exp(1);
$pow = "0.000000000000000000108420217248550443400745280086994171142578125";
$result = bcpow($e, $pow);
Result is "1" with the error, "bc math warning: non-zero scale in exponent".
Is there another function I can use instead of bcpow()?
Your best bet is probably to use the Taylor series expansion. As you noted, PHP's bcpow is limited to raising to integer exponentiation.
So what you can do is roll your own bc factorial function and use the wiki page to implement a Taylor series expansion of the exponential function.
function bcfac($num) {
if ($num==0) return 1;
$result = '1';
for ( ; $num > 0; $num--)
$result = bcmul($result,$num);
return $result;
}
$mysum = '0';
for ($i=0; $i<300; $i++) {
$mysum = bcadd($mysum, bcdiv(bcpow($pow,$i), bcfac($i)) );
}
print $mysum;
Obviously, the $i<300 is an approximation for infinity... You can change it to suit your performance needs.
With $i=20, I got
1.00000000000000000010842021724855044340662275184110560868263421994092888869270293594926619547803962155136242752708629105688492780863293090291376157887898519458498571566021915144483905034693109606778068801680332504212458366799913406541920812216634834265692913062346724688397654924947370526356787052264726969653983148004800229537555582281617497990286595977830803702329470381960270717424849203303593850108090101578510305396615293917807977774686848422213799049363135722460179809890014584148659937665374616
This is comforting since that small of an exponent should yield something really close to 1.0.
Old question, but people might still be interested nonetheless.
So Kevin got the right idea with the Taylor-polynomial, but when you derive your algorithm from it directly, you can get into trouble, mainly your code gets slow for long input-strings when using large cut-off values for $i.
Here is why:
At every step, by which I mean with each new $i, the code calls bcfac($i). Everytime bcfac is called it performs $i-1 calculations. And $i goes all the way up to 299... that's almost 45000 operations! Not your quick'n'easy floating point operations, but slow BC-string-operations - if you set bcscale(100) your bcmul has to handle up to 10000 pairs of chars!
Also bcpow slows down with increasing $i, too. Not as much as bcfac, because it propably uses something akin to the square-and-multiply method, but it still adds something.
Overall the time required grows quadraticly with the number of polynomial terms computed.
So... what to do?
Here's a tip:
Whenever you handle polynomials, especially Taylor-polynomials, use the Horner method.
It converts this: exp(x) = x^0/0! + x^1/1! + x^2/2! + x^3/3! + ...
...into that: exp(x) = ((( ... )*x/3+1 )*x/2+1 )*x/1+1
And suddenly you don't need any powers or factorials at all!
function bc_exp($number) {
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $number), 1);
return $result;
}
This needs only 3 bc-operations for each step, no matter what $i is.
With a starting value of $i=299 (to calculate exp with the same precision as kevin's code does) we now only need 897 bc-operations, compared to more than 45000.
Even using 30 as cut-off instead of 300, we now only need 87 bc-operations while the other code still needs 822 for the factorials alone.
Horner's Method saving the day again!
Some other thoughts:
1) Kevin's code would propably crash with input="0", depending on how bcmath handles errors, because the code trys bcpow(0,0) at the first step ($i=0).
2) Larger exponents require longer polynomials and therefore more iterations, e.g. bc_exp(300) will give a wrong answer, even with $i=299, whyle something like bc_exp(3) will work fine and dandy.
Each term adds x^n/n! to the result, so this term has to get small before the polynomial can start to converge. Now compare two consecutive terms:
( x^(n+1)/(n+1)! ) / ( x^n/n! ) = x/n
Each summand is larger than the one before by a factor of x/n (which we used via the Horner method), so in order for x^(n+1)/(n+1)! to get small x/n has to get small as well, which is only the case when n>x.
Inconclusio: As long as the number of iterations is smaller than the input value, the result will diverge. Only when you add steps until your number of iterations gets larger than the input, the algorithm starts to slowly converge.
In order to reach results that can satisfie someone who is willing to use bcmath, your $i needs to be significantly larger then your $number. And that's a huge proplem when you try to calculate stuff like e^346674567801
A solution is to divide the input into its integer part and its fraction part.
Than use bcpow on the integer part and bc_exp on the fraction part, which now converges from the get-go since the fraction part is smaller than 1. In the end multiply the results.
e^x = e^(intpart+fracpart) = e^intpart * e^fracpart = bcpow(e,intpart) * bc_exp(fracpart)
You could even implement it directly into the code above:
function bc_exp2($number) {
$parts = explode (".", $number);
$fracpart = "0.".$parts[1];
$result = 1;
for ($i=299; $i>0; $i--)
$result = bcadd(bcmul(bcdiv($result, $i), $fracpart), 1);
$result = bcmul(bcpow(exp(1), $parts[0]), $result);
return $result;
}
Note that exp(1) gives you a floating-point number which propably won't satisfy your needs as a bcmath user. You might want to use a value for e that is more accurate, in accordance with your bcscale setting.
3) Talking about numbers of iterations: 300 will be overkill in most situations while in some others it might not even be enough. An algorithm that takes your bcscale and $number and calculates the number of required iterations would be nice. Alraedy got some ideas involving log(n!), but nothing concrete yet.
4) To use this method with an arbitrary base you can use a^x = e^(x*ln(a)).
You might want to divide x into its intpart and fracpart before using bc_exp (instead of doing that within bc_exp2) to avoid unneccessary function calls.
function bc_pow2($base,$exponent) {
$parts = explode (".", $exponent);
if ($parts[1] == 0){
$result = bcpow($base,$parts[0]);
else $result = bcmul(bc_exp(bcmul(bc_ln($base), "0.".$parts[1]), bcpow($base,$parts[0]);
return result;
}
Now we only need to program bc_ln. We can use the same strategy as above:
Take the Taylor-polynomial of the natural logarithm function. (since ln(0) isn't defined, take 1 as developement point instead)
Use Horner's method to drasticly improve performance.
Turn the result into a loop of bc-operations.
Also make use of ln(x) = -ln(1/x) when handling x > 1, to guarantee convergence.
usefull functions(don't forget to set bcscale() before using them)
function bc_fact($f){return $f==1?1:bcmul($f,bc_fact(bcsub($f, '1')));}
function bc_exp($x,$L=50){$r=bcadd('1.0',$x);for($i=0;$i<$L;$i++){$r=bcadd($r,bcdiv(bcpow($x,$i+2),bc_fact($i+2)));}return $r;}#e^x
function bc_ln($x,$L=50){$r=0;for($i=0;$i<$L;$i++){$p=1+$i*2;$r = bcadd(bcmul(bcdiv("1.0",$p),bcpow(bcdiv(bcsub($x,"1.0"),bcadd($x,"1.0")),$p)),$r);}return bcmul("2.0", $r);}#2*Sum((1/(2i+1))*(((x-1)/x+1)^(2i+1)))
function bc_pow($x,$p){return bc_exp(bcmul((bc_ln(($x))), $p));}
I want to generate a random string and was doing some research and found the following link:
http://golearnphp.com/php-rand-vs-mt_rand-and-openssl_random_pseudo_bytes/
function generateRandom($length) {
$validCharacters = 'abcdefghijklmnopqrstuvwxyz0123456789';
$myKeeper = '';
for ($n = 1; $n < $length; $n++) {
$whichCharacter = rand(0, strlen($validCharacters) - 1);
$myKeeper .= $validCharacters{$whichCharacter};
}
return $myKeeper;
}
function generateRandomdMT($length) {
$validCharacters = 'abcdefghijklmnopqrstuvwxyz0123456789';
$myKeeper = '';
for ($n = 1; $n < $length; $n++) {
$whichCharacter = mt_rand(0, strlen($validCharacters) - 1);
$myKeeper .= $validCharacters{$whichCharacter};
}
return $myKeeper;
}
$start = microtime(true);
echo htmlentities(generateRandom(100000));
var_dump(microtime(true) - $start);
$start = microtime(true);
echo htmlentities(generateRandomdMT(100000));
var_dump(microtime(true) - $start);
$start = microtime(true);
echo htmlentities(substr(base64_encode(openssl_random_pseudo_bytes(100000)), 0, 100000));
var_dump(microtime(true) - $start);
In the post the writer is saying that openssl_random_pseudo_bytes is significant faster then the other two. Is this true? Is openssl_random_pseudo_bytes really that much faster? Is that the correct way to test the "fastness" of functions?
openssl_random_pseudo_bytes created to be crypto strong(check the second param). Rand is old rand function with small period of repeating. MT_Rand is better than rand but not supposed to be used by crypto systems.
I bet that the difference between execution time do not impact on your application.
Also. Those functions return different results. First two return string with 36 possible letters. And third one returns string with 64 possible symbols. Result of two first function is shorter than third one.
If you are making optimization to speed up your application first thing that you should to know: how to profile your code.
In the post the writer is saying that openssl_random_pseudo_bytes is significant faster then the other two. Is this true?
In normal situations mt_rand() is significantly faster than openssl_random_pseudo_bytes().
It's only slower in the test code you've posted because you are comparing apples and oranges. For rand() and mt_rand() you are using complex functions which build up a string one byte at a time, whereas for openssl_random_pseudo_bytes() you're using the raw binary stream it produces with base64_encode() which is going to be much faster.
If you could get a raw binary stream out of mt_rand() or rand(), or a sequence of numbers 0 to 63 from openssl_random_pseudo_bytes(), you could do an apples to apples comparison.
In my testing, I found mt_rand() about 4 times as fast as openssl_random_pseudo_bytes(4) when I used unpack('V', openssl_random_pseudo_bytes(4) & "\xff\xff\xff\x7f") in order to get an equivalent output to mt_rand(). However this is still technically an apples to oranges situation because I'm doing additional processing on one in order to match it to the other, just in the opposite direction to you.
The time you asked this question, there was a bug report here > https://bugs.php.net/bug.php?id=70014 (php 5.6.10) It seems to be fixed in new versions of PHP.
My experience using it has always been unnecessary, I prefer Mt_Rand() but if you are generating random values for encryption purposes like I am doing, then do not use it, you should use random_bytes() ref. https://www.php.net/manual/en/function.random-bytes.php
I need to implement the Cutting Stock Problem with a php script.
As my math skills are not that great I am just trying to brute force it.
Starting with these parameters
$inventory is an array of lengths that are available to be cut.
$requestedPieces is an array of lengths that were requested by the
customer.
$solution is an empty array
I have currently worked out this recursive function to come up with all possible solutions:
function branch($inventory, $requestedPieces, $solution){
// Loop through the requested pieces and find all inventory that can fulfill them
foreach($requestedPieces as $requestKey => $requestedPiece){
foreach($inventory as $inventoryKey => $piece){
if($requestedPiece <= $piece){
$solution2 = $solution;
array_push($solution2, array($requestKey, $inventoryKey));
$requestedPieces2 = $requestedPieces;
unset($requestedPieces2[$requestKey]);
$inventory2 = $inventory;
$inventory2[$inventoryKey] = $piece - $requestedPiece;
if(count($requestedPieces2) > 0){
branch($inventory2, $requestedPieces2, $solution2);
}else{
global $solutions;
array_push($solutions, $solution2);
}
}
}
}
}
The biggest inefficiency I have discovered with this is that it will find the same solution multiple times but with the steps in a different order.
For example:
$inventory = array(1.83, 20.66);
$requestedPieces = array(0.5, 0.25);
The function will come up with 8 solutions where it should come up with 4 solutions.
What is a good way to resolve this.
This does not answer your question, but I thought it could be worth being mentioned:
You have several other ways to solve your problem, rather than brute forcing it. The wikipedia page on the topic is pretty thorough, but I'll just describe two others simpler ideas. I will use the wikipedia terminology for certain words, namely master for inventory piece, and cut for a requested piece. I will use set to denote a set of cuts pertaining to a given master.
The first one is based on the greedy algorithm, and consist in filling a set with the largest available cut, until no more cut may fit, and repeat that same process for each master, yielding a set for each one of them.
The second one is more dynamic: it uses recursion (like yours), and look for the best fit for the remaining length of master and cuts at each step of the recursion, the goal being to minimize the wasted length when no more cuts can fit.
function branch($master, $cuts, $set){
$goods = array_filter($cuts, function($v) use ($master) { return $v <= $master;});
$res = array($master,$set,$cuts);
if (empty($goods))
return $res;
$remaining = array_diff($cuts, $goods);
foreach($goods as $k => $g){
$t = $set;
array_push($t, $g);
$r = $remaining;
$c = $goods;
for ($i = 0; $i < $k; $i++)
array_push($r,array_shift($c));
array_shift($c);
$t = branch($master - $g, $c, $t);
array_walk($r, function($k,$v) use ($t) {array_push($t[2], $v);});
if ($t[0] == 0) return $t;
if ($t[0] < $res[0])
$res = $t;
}
return $res;
}
The function above should give you the optimal set for a given master. It returns an array of 3 values:
the wasted length on master
the set
the remaining cuts
The parameters are
the master length,
the cuts to be performed (must be sorted in descending order),
the set of cuts already scheduled (a preexisting set, which would be empty for the first call for each master)
Caveats: It depends on the masters' order, you could certainly write a function which tries all the relevant possibilities to find the best order of masters.
I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.
PHPBench.com runs quick benchmark scripts on each pageload. On the foreach test, when I load it, foreach takes anywhere from 4 to 10 times as long to run than the third example.
Why is it that a native language construct is apparently slower than performing the logic oneself?
Maybe it has to do with the fact that foreach works on a copy of the array ?
Or maybe it has to do with the fact that, when looping with foreach, on each iteration, the internal array pointer is changed, to point to the next element ?
Quoting the relevant portion of foreach's manual page :
Note: Unless the array is referenced,
foreach operates on a copy of the
specified array and not the array
itself. foreach has some side effects
on the array pointer.
As far as I can tell, the third test you linked to doesn't do any of those two things -- which means both tests don't do the same thing -- which means you are not comparing two way of writing the same code.
(I would also say that this kind of micro-optimization will not matter at all in a real application -- but I guess you already know that, and just asked out of curiosity)
There is also one thing that doesn't feel right in this test : it only does the test one time ;; for a "better" test, it might be useful to test all of those more than once -- with timings in the order of 100 micro-seconds, not much is required to make a huge difference.
(Considering the first test varies between 300% and 500% on a few refreshes...)
For those who don't want to click, here's the first test (I've gotten 3xx%, 443%, and 529%) :
foreach($aHash as $key=>$val) {
$aHash[$key] .= "a";
}
And the third one (100%) :
$key = array_keys($aHash);
$size = sizeOf($key);
for ($i=0; $i<$size; $i++) {
$aHash[$key[$i]] .= "a";
}
I'm sorry, but the website got it wrong. Here's my own script that shows the two are almost the same in speed, and in fact, foreach is faster!
<?php
function start(){
global $aHash;
// Initial Configuration
$i = 0;
$tmp = '';
while($i < 10000) {
$tmp .= 'a';
++$i;
}
$aHash = array_fill(100000000000000000000000, 100, $tmp);
unset($i, $tmp);
reset($aHash);
}
/* The Test */
$t = microtime(true);
for($x = 0;$x<500;$x++){
start();
$key = array_keys($aHash);
$size = sizeOf($key);
for ($i=0; $i<$size; $i++) $aHash[$key[$i]] .= "a";
}
print (microtime(true) - $t);
print ('<br/>');
$t = microtime(true);
for($x = 0;$x<500;$x++){
start();
foreach($aHash as $key=>$val) $aHash[$key] .= "a";
}
print (microtime(true) - $t);
?>
If you look at the source code of the tests: http://www.phpbench.com/source/test2/1/ and http://www.phpbench.com/source/test2/3/ , you can see that $aHash isn't repopulated to the initial data after each iteration. It is created once at the beginning, then each test is ran X times. In this sense, you are working with an ever growing $aHash for each iteration... in psuedocode:
iteration 1: $aHash[10000000000000]=='aaaaaa....10000 times...a';
iteration 2: $aHash[10000000000000]=='aaaaaa....10001 times...a';
iteration 2: $aHash[10000000000000]=='aaaaaa....10002 times...a';
Over time, the data for all the tests is getting larger for each iteration, so of course by iteration 100, the array_keys method is faster because it'll always have the same keys, where as the foreach loop has to contend with an ever growing data set and store the values in arrays!
If you run my code provided above on your server, you'll see clearly that foreach is faster AND neater AND clearer.
If the author of the site intended his test to do what it does, then it certainly is not clear, and otherwise, it's an invalid test.
Benchmark results for such micro measurements, coming from a live, busy webserver that is subject to extreme amounts of varying load and other influences, should be disregarded. This is not an environment to benchmark in.