PHP - Optimize finding closest point in an Array

PHP - Optimize finding closest point in an Array - php

I have created a script which gets a big array of points and then finds the closest point in 3D-space based on a limited array of chosen points. It works great. However, sometimes I get like over 2 Million points to compare to an array of 256 items so it is over 530 million calculations! Which takes a considerable amount of time and power (taking that it will be comparing stuff like that few times a min).
I have a limited group of 3D coordinates like this:
array (size=XXX)
0 => 10, 20, 30
1 => 200, 20, 13
2 => 36, 215, 150
3 => ...
4 => ...
... // this is limited to max 256 items
Then I have another very large group of, let's say, random 3D coordinates which can vary in size from 2,500 -> ~ 2,000,000+ items. Basically, what I need to do is to iterate through each of those points and find the closest point. To do that I use Euclidean distance:
sq((q1-p1)2+(q2-p2)2+(q3-p3)2)
This gives me the distance and I compare it to the current closest distance, if it is closer, replace the closest, else continue with next set.
I have been looking on how to change it so I don't have to do so many calculations. I have been looking at Voronoi Diagrams then maybe place the points in that diagram, then see which section it belongs to. However, I have no idea how I can implement such a thing in PHP.
Any idea how I can optimize it?

Just a quick shot from the hip ;-)
You should be able to gain a nice speed up if you dont compare each point to each other point. Many points can be skipped because they are already to far away if you just look at one of the x/y/z coordinates.
<?php
$coord = array(18,200,15);
$points = array(
array(10,20,30),
array(200,20,13),
array(36,215,150)
);
$closestPoint = $closestDistance= false;;
foreach($points as $point) {
list($x,$y,$z) = $point;
// Not compared yet, use first poit as closest
if($closestDistance === false) {
$closestPoint = $point;
$closestDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
continue;
}
// If distance in any direction (x/y/z) is bigger than closest distance so far: skip point
if(abs($coord[0] - $x) > $closestDistance) continue;
if(abs($coord[1] - $y) > $closestDistance) continue;
if(abs($coord[2] - $z) > $closestDistance) continue;
$newDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
if($newDistance < $closestDistance) {
$closestPoint = $point;
$closestDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
}
}
var_dump($closestPoint);
function distance($x1,$y1,$z1,$x2,$y2,$z2) {
return sqrt(pow($x1-$x2,2) + pow($y1 - $y2,2) + pow($z1 - $z2,2));
}
A working code example can be found at http://sandbox.onlinephpfunctions.com/code/8cfda8e7cb4d69bf66afa83b2c6168956e63b51e

Related

Permutations and big arrays in PHP - performance issues

I have an array of numbers (int or float) and I need to find a value by combining array values. Once the smallest possible combination is found the function returns the array values. Therefore I start with sample-size=1 and keep incrementing it.
Here's a simplified example of the given data:
$values = [10, 20, 30, 40, 50];
$lookingFor = 80;
Valid outcomes:
[30, 50] // return this
[10, 20, 50], [10, 30, 40] // just to demonstrate the possible combinations
Permutations solve this problem and I've tried many different implementations (for example: Permutations - all possible sets of numbers, Get all permutations of a PHP array?, https://github.com/drupol/phpermutations). My favourite is this one with a parameter for permutation-size using the Generator pattern: https://stackoverflow.com/a/43307800
What's my problem? Performance! My arrays have 5 - 150 numbers and sometimes the sum of 30 array numbers is needed to find the searched value. Sometimes the value can't be found, which means I needed to try all possible combinations. Basically with permutation-size > 5 the task becomes too time consuming.
An alternative, yet not precise way is to sort the array, take the first X and last X numbers and compare with the searched value. Like this:
sort($values, SORT_NUMERIC);
$countValues = count($values);
if ($sampleSize > $countValues)
{
$sampleSize = $countValues;
}
$minValues = array_slice($values, 0, $sampleSize);
$maxValues = array_slice($values, $countValues - $sampleSize, $sampleSize);
$possibleMin = array_sum($minValues);
$possibleMax = array_sum($maxValues);
if ($possibleMin === $lookingFor)
{
return $minValues;
}
if ($possibleMax === $lookingFor)
{
return $maxValues;
}
return [];
Hopefully somebody has dealt with a similar problem and can guide me in the right direction. Thank you!

you must use combination instead of permutations {ex: P(15) = 130767436800 vs C(15) = 32768}
if array_sum < target_number then no solution exists
if in_array(target_number, numbers) solution found with 1 element
sort lowest to highest
start with C(n,2) where 2 represents 1st 2nd then 1st 3rd etc (static one is 1st element)
if above loop found no solution continue with 2nd 3rd then 2nd 4th, etc)
if C(n,2) had no solution then jump to C(n,3)s but this time 2 static numbers and 1 dynamic one
if loop ended with no solution then there exists no solution
lastly, I would adjust this question and ask in statistics branch of stack exchange (crossvalidated) since mean, median and cumulative distribution of the sums of the numbers may hint to decrease the number of iterations significantly and this is their profession.

Cartesian product with specific criteria

I am attempting to find the cartesian product and append specific criteria.
I have four pools of 25 people each. Each person has a score and a price. Each person in each pool looks as such.
[0] => array(
"name" => "jacob",
"price" => 15,
"score" => 100
),
[1] => array(
"name" => "daniel",
"price" => 22,
"score" => 200
)
I want to find the best combination of people, with one person being picked from each pool. However, there is a ceiling price where no grouping can exceed a certain price.
I have been messing with cartesians and permutation functions and cannot seem to figure out how to do this. The only way I know how to code it is to have nested foreach loops, but that is incredibly taxing.
This code below, as you can see, is incredibly inefficient. Especially if the pools increase!
foreach($poolA as $vA) {
foreach($poolb as $vB) {
foreach($poolC as $vC) {
foreach($poolD as $vD) {
// calculate total price and check if valid
// calculate total score and check if greatest
// if so, add to $greatest array
}
}
}
}
I also thought I could find a way to calculate the total price/score ratio and use that to my advantage, but I don't know what I'm missing.

As pointed out by Barmar, sorting the people in each pool allows you to halt the loops early when the total price exceeds the limit and hence reduces the number of cases you need to check. However, the asymptotic complexity for applying this improvement is still O(n4) (where n is the number of people in a pool).
I will outline an alternative approach with better asymptotic complexity as follow:
Construct a pool X that contains all pairs of people with one from pool A and the other from pool B.
Construct a pool Y that contains all pairs of people with one from pool C and the other from pool D.
Sort the pairs in pool X by total price. Then for any pairs with the same price, retain the one with the highest score and discard the remaining pairs.
Sort the pairs in pool Y by total price. Then for any pairs with the same price, retain the one with the highest score and discard the remaining pairs.
Do a loop with two pointers to check over all possible combinations that satisfy the price constraint, where the head pointer starts at the first item in pool X, and the tail pointer starts at the last item in pool Y. Sample code is given below to illustrate how this loop works:
==========================================================================
$head = 0;
$tail = sizeof($poolY) - 1;
while ($head < sizeof($poolX) && $tail >= 0) {
$total_price = $poolX[$head].price + $poolY[$tail].price;
// Your logic goes here...
if ($total_price > $price_limit) {
$tail--;
} else if ($total_price < $price_limit) {
$head++;
} else {
$head++;
$tail--;
}
}
for ($i = $head; $i < sizeof($poolX); $i++) {
// Your logic goes here...
}
for ($i = $tail; $i >= 0; $i--) {
// Your logic goes here...
}
==========================================================================
The complexity of steps 1 and 2 are O(n2), and the complexity of steps 3 and 4 can be done in O(n2 log(n)) using balanced binary tree. And step 5 is essentially a linear scan over n2 items, so the complexity is also O(n2). Therefore the overall complexity of this approach is O(n2 log(n)).

A couple of things to note about your approach here. Speaking strictly from a mathematics perspective, you're calculating way more permutations than is actually necessary to arrive at a definitive answer.
In combinatorics, there are two important questions to ask in order to arrive at the exact number of permutations necessary to yield all possible combinations.
Does order matter? (for your case, it does not)
Is repetition allowed? (for your case, it is not necessary to repeat)
Since the answer to both of these question is no, you need only a fraction of the iterations you're currently doing with your nested loop. Currently you are doing, pow(25, 4) permutations, which is 390625. You only actually need n! / r! (n-r)! or gmp_fact(25) / (gmp_fact(4) * gmp_fact(25 - 4)) which is only 12650 total permutations needed.
Here's a simple example of a function that produces combinations without repetition (and where order does not matter), using a generator in PHP (taken from this SO answer).
function comb($m, $a) {
if (!$m) {
yield [];
return;
}
if (!$a) {
return;
}
$h = $a[0];
$t = array_slice($a, 1);
foreach(comb($m - 1, $t) as $c)
yield array_merge([$h], $c);
foreach(comb($m, $t) as $c)
yield $c;
}
$a = range(1,25); // 25 people in each pool
$n = 4; // 4 pools
foreach(comb($n, $a) as $i => $c) {
echo $i, ": ", array_sum($c), "\n";
}
It would be pretty easy to modify the generator function to check whether the sum of prices meets/exceeds the desired threshhold and only return valid results from there (i.e. abandoning early where needed).
The reason repetition and order are not important here for your use case, is because it doesn't matter whether you add $price1 + $price2 or $price2 + $price1, the result will undoubtedly be the same in both permutations. So you only need to add up each unique set once to ascertain all possible sums.

Similar to chiwangs solutions, you may eliminate up front every group member, where another group member in that group exists, with same or higher score for a lower price.
Maybe you can eliminate many members in each group with this approach.
You may then either use this technique, to build two pairs and repeat the filtering (eliminate pairs, where anothr pair exists, with higher score for the same or lower costs) and then combine the pairs the same way, or add a member step by step (one pair, a triple, a quartett).
If there exists some member, who exceed the allowed sum price on their own, they can be eliminated up front.
If you order the 4 groups by score descending, and you find a solution abcd, where the sum price is legal, you found the optimal solution for a given set of abc.

The reponses here helped me figure out the best way for me to do this.
I haven't optimized the function yet, but essentially I looped through each results two at a time to find the combined salaries / scores for each combination in the two pools.
I stored the combined salary -> score combination in a new array, and if the salary already existed, I'd compare scores and remove the lower one.
$results = array();
foreach($poolA as $A) {
foreach($poolB as $B) {
$total_salary = $A['Salary'] + $B['Salary'];
$total_score = $A['Score'] + $B['Score'];
$pids = array($A['pid'], $B['pid']);
if(isset($results[$total_salary]) {
if($total_score > $results[$total_salary]['Score']) {
$results[$total_salary]['Score'] => $total_score;
$results[$total_salary]['pid'] => $pids;
} else {
$results[$total_salary]['Score'] = $total_score;
$results[$total_salary]['pid'] = $pids;
}
}
}
After this loop, I have another one that is identical, except my foreach loops are between $results and $poolC.
foreach($results as $R) {
foreach($poolC as $C) {
and finally, I do it one last time for $poolD.
I am working on optimizing the code by putting all four foreach loops into one.
Thank you everyone for your help, I was able to loop through 9 lists with 25+ people in each and find the best result in an incredibly quick processing time!

Highest multiplication with multiple and max number set

Not quite sure what to set this title as, or what to even search for. So I'll just ask the question and hope I don't get too many downvotes.
I'm trying to find the easiest way to find the highest possible number based on two fixed numbers.
For example:
The most I can multiply by is, say, 18 (first number). But not going over the resulted number, say 100 (second number).
2 x 18 = 36
5 x 18 = 90
But if the first number is a higher number, the second number would need to be less than 18, like so:
11 x 9 = 99
16 x 6 = 96
Here I would go with 11, because even though the second number is only 9, the outcome is the highest. The second number could be anything as long as it's 18 or lower. The first number can be anything, as long as the answer remains below 100. Get what I mean?
So my question is, how would write this in php without having to use switches, if/then statements, or a bunch of loops? Is there some math operator I don't know about that handles this sort of thing?
Thanks.
Edit:
The code that I use now is:
function doMath($cost, $max, $multiplier) {
do {
$temp = $cost * $multiplier;
if ($temp > $max) { --$multiplier; }
} while ($temp > $max);
return array($cost, $temp, $multiplier);
}
If we look at the 11 * 9 = 99 example,
$result = doMath(11, 100, 18);
Would return,
$cost = 11, $temp = 99, $multiplier = 9
Was hoping there was an easier way so that I wouldn't need to use a loop, being as how there are a lot of numbers I need to check.

If I understood you right, you are looking for the floor function, combining it with the min function.
Both a bigger number c and a smaller number a are part of the problem, and you want to find a number b in the range [0, m] such that a * b is maximal while staying smaller (strictly) than c.
In your example, 100/18 = 5.55555, so that means that 18*5 is smaller than 100, and 18*6 is bigger than 100.
Since floor gets you the integral part of a floating point number, $b = floor($c/$a) does what you want. When a divides c (that is, c/a is an integer already), you get a * b == c.
Now b may be outside of [0,m] so we want to take the smallest of b and m :
if b is bigger than m, we are limited by m,
and if m is bigger than b, we are limited by a * b <= c.
So in the end, your function should be :
function doMath($cost, $max, $multiplier)
{
$div = min($multiplier, floor($max/$cost));
return array($cost, $div * $cost, $div);
}

Controlling likelyhood of randomly generated numbers

If I wanted a random number between one and three I could do $n = mt_rand(1,3).
There is a 33% chance that $n = 1, a 33% chance it's 2, and a 33% chance that it's 3.
What if I want to make it more difficult to get a 3 than a 1?
Say I want a 50% chance that a 1 is drawn, a 30% chance that a 2 is drawn and a 20% chance that a 3 is drawn?
I need a scalable solution as the possible range will vary between 1-3 and 1-100, but in general I'd like the lower numbers to be drawn more often than the higher ones.
How can I accomplish this?

There is a simple explanation of how you can use standard uniform random variable to produce random variable with a distribution similar to the one you want:
https://math.stackexchange.com/a/241543

This is maths.
In your example the just chose a random number between 0 and 99.
Values returned between 0 to 49 - call it 1
Values returned between 50 - 69 - Call it 2
Values returned between 70 - 99 - Call it 3
Simple if statement will do this or populate an array for the distribution required

Assuming a 1 - 10 scale, you can use a simple if statement and have the numbers represent percentages. And just have each if statement set $n to a specific. Only downfall, it isn't universal.
$dummy = mt_rand(1,10);
// represents 50%
if ($dummy <= 5) {
$n = 1;
}
// represents 40%
if ($dummy >= 6 && $dummy <= 9) {
$n = 2;
} else {
// represents 10%
$n = 3;
}

Calculate average without being thrown by strays

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?

You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.

Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P

Why don't you use the median? It's not 30, it's 21.5.

You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.

You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - Optimize finding closest point in an Array - php

Related

Permutations and big arrays in PHP - performance issues

Cartesian product with specific criteria

Highest multiplication with multiple and max number set

Controlling likelyhood of randomly generated numbers

Calculate average without being thrown by strays

Categories

Resources