K-means clustering: What's wrong? (PHP)

K-means clustering: What's wrong? (PHP) - php

I was looking for a way to calculate dynamic market values in a soccer manager game. I asked this question here and got a very good answer from Alceu Costa.
I tried to code this algorithm (90 elements, 5 clustes) but it doesn't work correctly:
In the first iteration, a high percentage of the elements changes its cluster.
From the second iteration, all elements change their cluster.
Since the algorithm normally works until convergence (no element changes its cluster), it doesn't finish in my case.
So I set the end to the 15th iteration manually. You can see that it runs infinitely.
You can see the output of my algorithm here. What's wrong with it? Can you tell me why it doesn't work correctly?
I hope you can help me. Thank you very much in advance!
Here's the code:
<?php
include 'zzserver.php';
function distance($player1, $player2) {
global $strengthMax, $maxStrengthMax, $motivationMax, $ageMax;
// $playerX = array(strength, maxStrength, motivation, age, id);
$distance = 0;
$distance += abs($player1['strength']-$player2['strength'])/$strengthMax;
$distance += abs($player1['maxStrength']-$player2['maxStrength'])/$maxStrengthMax;
$distance += abs($player1['motivation']-$player2['motivation'])/$motivationMax;
$distance += abs($player1['age']-$player2['age'])/$ageMax;
return $distance;
}
function calculateCentroids() {
global $cluster;
$clusterCentroids = array();
foreach ($cluster as $key=>$value) {
$strenthValues = array();
$maxStrenthValues = array();
$motivationValues = array();
$ageValues = array();
foreach ($value as $clusterEntries) {
$strenthValues[] = $clusterEntries['strength'];
$maxStrenthValues[] = $clusterEntries['maxStrength'];
$motivationValues[] = $clusterEntries['motivation'];
$ageValues[] = $clusterEntries['age'];
}
if (count($strenthValues) == 0) { $strenthValues[] = 0; }
if (count($maxStrenthValues) == 0) { $maxStrenthValues[] = 0; }
if (count($motivationValues) == 0) { $motivationValues[] = 0; }
if (count($ageValues) == 0) { $ageValues[] = 0; }
$clusterCentroids[$key] = array('strength'=>array_sum($strenthValues)/count($strenthValues), 'maxStrength'=>array_sum($maxStrenthValues)/count($maxStrenthValues), 'motivation'=>array_sum($motivationValues)/count($motivationValues), 'age'=>array_sum($ageValues)/count($ageValues));
}
return $clusterCentroids;
}
function assignPlayersToNearestCluster() {
global $cluster, $clusterCentroids;
$playersWhoChangedClusters = 0;
// BUILD NEW CLUSTER ARRAY WHICH ALL PLAYERS GO IN THEN START
$alte_cluster = array_keys($cluster);
$neuesClusterArray = array();
foreach ($alte_cluster as $alte_cluster_entry) {
$neuesClusterArray[$alte_cluster_entry] = array();
}
// BUILD NEW CLUSTER ARRAY WHICH ALL PLAYERS GO IN THEN END
foreach ($cluster as $oldCluster=>$clusterValues) {
// FOR EVERY SINGLE PLAYER START
foreach ($clusterValues as $player) {
// MEASURE DISTANCE TO ALL CENTROIDS START
$abstaende = array();
foreach ($clusterCentroids as $CentroidId=>$centroidValues) {
$distancePlayerCluster = distance($player, $centroidValues);
$abstaende[$CentroidId] = $distancePlayerCluster;
}
arsort($abstaende);
if ($neuesCluster = each($abstaende)) {
$neuesClusterArray[$neuesCluster['key']][] = $player; // add to new array
// player $player['id'] goes to cluster $neuesCluster['key'] since it is the nearest one
if ($neuesCluster['key'] != $oldCluster) {
$playersWhoChangedClusters++;
}
}
// MEASURE DISTANCE TO ALL CENTROIDS END
}
// FOR EVERY SINGLE PLAYER END
}
$cluster = $neuesClusterArray;
return $playersWhoChangedClusters;
}
// CREATE k CLUSTERS START
$k = 5; // Anzahl Cluster
$cluster = array();
for ($i = 0; $i < $k; $i++) {
$cluster[$i] = array();
}
// CREATE k CLUSTERS END
// PUT PLAYERS IN RANDOM CLUSTERS START
$sql1 = "SELECT ids, staerke, talent, trainingseifer, wiealt FROM ".$prefix."spieler LIMIT 0, 90";
$sql2 = mysql_abfrage($sql1);
$anzahlSpieler = mysql_num_rows($sql2);
$anzahlSpielerProCluster = $anzahlSpieler/$k;
$strengthMax = 0;
$maxStrengthMax = 0;
$motivationMax = 0;
$ageMax = 0;
$counter = 0; // for $anzahlSpielerProCluster so that all clusters get the same number of players
while ($sql3 = mysql_fetch_assoc($sql2)) {
$assignedCluster = floor($counter/$anzahlSpielerProCluster);
$cluster[$assignedCluster][] = array('strength'=>$sql3['staerke'], 'maxStrength'=>$sql3['talent'], 'motivation'=>$sql3['trainingseifer'], 'age'=>$sql3['wiealt'], 'id'=>$sql3['ids']);
if ($sql3['staerke'] > $strengthMax) { $strengthMax = $sql3['staerke']; }
if ($sql3['talent'] > $maxStrengthMax) { $maxStrengthMax = $sql3['talent']; }
if ($sql3['trainingseifer'] > $motivationMax) { $motivationMax = $sql3['trainingseifer']; }
if ($sql3['wiealt'] > $ageMax) { $ageMax = $sql3['wiealt']; }
$counter++;
}
// PUT PLAYERS IN RANDOM CLUSTERS END
$m = 1;
while ($m < 16) {
$clusterCentroids = calculateCentroids(); // calculate new centroids of the clusters
$playersWhoChangedClusters = assignPlayersToNearestCluster(); // assign each player to the nearest cluster
if ($playersWhoChangedClusters == 0) { $m = 1001; }
echo '<li>Iteration '.$m.': '.$playersWhoChangedClusters.' players have changed place</li>';
$m++;
}
print_r($cluster);
?>

It's so embarassing :D I think the whole problem is caused by only one letter:
In assignPlayersToNearestCluster() you can find arsort($abstaende);. After that, the function each() takes the first value. But it's arsort so the first value must be the highest. So it picks the cluster which has the highest distance value.
So it should be asort, of course. :) To prove that, I've tested it with asort - and I get convergence after 7 iterations. :)
Do you think that was the mistake? If it was, then my problem is solved. In that case: Sorry for annoying you with that stupid question. ;)

EDIT: disregard, I still get the same result as you, everyone winds up in cluster 4. I shall reconsider my code and try again.
I think I've realised what the problem is, k-means clustering is designed to break up differences in a set, however, because of the way you calculate averages etc. we are getting a situation where there are no large gaps in the ranges.
Might I suggest a change and only concentrate on a single value(strength appears to make most sense to me) to determine the clusters, or abandon this sorting method altogether, and adopt something different(not what you want to hear I know)?
I found a rather nice site with an example k-mean sort using integers, I'm going to try and edit that, I will get back with the results some time tomorrow.
http://code.blip.pt/2009/04/06/k-means-clustering-in-php/ <-- link I mentioned and forgot about.

Related

Is there a faster way than array_diff in PHP

I have a set of numbers from MySQL within the range 1000 0000 (8 digits) to 9 999 999 999 (10 digits). It's supposed to be consecutive, but there are missing numbers. I need to know which numbers are missing.
The range is huge. At first I was going to use PHP to do this:
//MySqli Select Query
$results = $mysqli->query("SELECT `OCLC Number` FROM `MARC Records by Number`");
$n_array = array();
while($row = $results->fetch_assoc()) {
$n_array[] = $row["OCLC Number"];
}
d($n_array);
foreach($n_array as $k => $val) {
print $val . " ";
}
/* 8 digits */
$counter = 10000000;
$master_array = array();
/* 10 digits */
while ($counter <= 9999999999 ) {
$master_array[] = $counter;
$counter++;
d($master_array);
}
d($master_array);
$missing_numbers_ar = array_diff ($master_array, $n_array);
d($missing_numbers_ar);
d() is a custom function akin to var_dump().
However, I just realized it would take tons of time for this to be done. At the 15 minute mark, $master_array is being populated with only 4000 numbers.
How can I do this in a quicker way? MySQL-only or MySQL-and-PHP solutions both welcome. If the optimal solution depends on how many numbers are missing, please let me know how so. Tq.

Your d() probably is the cause of slowness, please remove it, and make small changes in your code
while($row = $results->fetch_assoc()) {
$n_array[$row["OCLC Number"]] = 1;
}
and
$missing_numbers_ar = [];
while ($counter++ <= 9999999999 ) {
if (empty($n_array[$counter])) {
$missing_numbers_ar[] = $counter;
}
}

If the following is still slow I would be surprised. I also just noticed it is similar to #Hieu Vo's answer.
// Make sure the data is returned in order by adding
// an `ORDER BY ...` clause.
$results = $mysqli->query("SELECT `OCLC Number`
FROM `MARC Records by Number`
ORDER BY `OCLC Number`");
$n_array = array();
while($row = $results->fetch_assoc()) {
// Add the "OCLC Number" as a key to the array.
$n_array[$row["OCLC Number"]] = $row["OCLC Number"];
}
// assume the first array key is in fact correct
$i = key($n_array);
// get the last key, also assume it is not missing.
end($n_array);
$max = key($n_array);
// reset the array (should not be needed)
reset($n_array);
do {
if (! $n_array[$i]) {
echo 'Missing key:['.$i.']<br />';
// flush the data to the page as you go.
flush();
}
} while(++$i <= $max);

Knapsack Equation with item groups

Can't call it a problem on Stack Overflow apparently, however I am currently trying to understand how to integrate constraints in the form of item groups within the Knapsack problem. My math skills are proving to be fairly limiting in this situation, however I am very motivated to both make this work as intended as well as figure out what each aspect does (in that order since things make more sense when they work).
With that said, I have found an absolutely beautiful implementation at Rosetta Code and cleaned up the variable names some to help myself better understand this from a very basic perspective.
Unfortunately I am having an incredibly difficult time figuring out how I can apply this logic to include item groups. My purpose is for building fantasy teams, supplying my own value & weight (points/salary) per player but without groups (positions in my case) I am unable to do so.
Would anyone be able to point me in the right direction for this? I'm reviewing code examples from other languages and additional descriptions of the problem as a whole, however I would like to get the groups implemented by whatever means possible.
<?php
function knapSolveFast2($itemWeight, $itemValue, $i, $availWeight, &$memoItems, &$pickedItems)
{
global $numcalls;
$numcalls++;
// Return memo if we have one
if (isset($memoItems[$i][$availWeight]))
{
return array( $memoItems[$i][$availWeight], $memoItems['picked'][$i][$availWeight] );
}
else
{
// At end of decision branch
if ($i == 0)
{
if ($itemWeight[$i] <= $availWeight)
{ // Will this item fit?
$memoItems[$i][$availWeight] = $itemValue[$i]; // Memo this item
$memoItems['picked'][$i][$availWeight] = array($i); // and the picked item
return array($itemValue[$i],array($i)); // Return the value of this item and add it to the picked list
}
else
{
// Won't fit
$memoItems[$i][$availWeight] = 0; // Memo zero
$memoItems['picked'][$i][$availWeight] = array(); // and a blank array entry...
return array(0,array()); // Return nothing
}
}
// Not at end of decision branch..
// Get the result of the next branch (without this one)
list ($without_i,$without_PI) = knapSolveFast2($itemWeight, $itemValue, $i-1, $availWeight,$memoItems,$pickedItems);
if ($itemWeight[$i] > $availWeight)
{ // Does it return too many?
$memoItems[$i][$availWeight] = $without_i; // Memo without including this one
$memoItems['picked'][$i][$availWeight] = array(); // and a blank array entry...
return array($without_i,array()); // and return it
}
else
{
// Get the result of the next branch (WITH this one picked, so available weight is reduced)
list ($with_i,$with_PI) = knapSolveFast2($itemWeight, $itemValue, ($i-1), ($availWeight - $itemWeight[$i]),$memoItems,$pickedItems);
$with_i += $itemValue[$i]; // ..and add the value of this one..
// Get the greater of WITH or WITHOUT
if ($with_i > $without_i)
{
$res = $with_i;
$picked = $with_PI;
array_push($picked,$i);
}
else
{
$res = $without_i;
$picked = $without_PI;
}
$memoItems[$i][$availWeight] = $res; // Store it in the memo
$memoItems['picked'][$i][$availWeight] = $picked; // and store the picked item
return array ($res,$picked); // and then return it
}
}
}
$items = array("map","compass","water","sandwich","glucose","tin","banana","apple","cheese","beer","suntan cream","camera","t-shirt","trousers","umbrella","waterproof trousers","waterproof overclothes","note-case","sunglasses","towel","socks","book");
$weight = array(9,13,153,50,15,68,27,39,23,52,11,32,24,48,73,42,43,22,7,18,4,30);
$value = array(150,35,200,160,60,45,60,40,30,10,70,30,15,10,40,70,75,80,20,12,50,10);
## Initialize
$numcalls = 0;
$memoItems = array();
$selectedItems = array();
## Solve
list ($m4, $selectedItems) = knapSolveFast2($weight, $value, sizeof($value)-1, 400, $memoItems, $selectedItems);
# Display Result
echo "<b>Items:</b><br>" . join(", ", $items) . "<br>";
echo "<b>Max Value Found:</b><br>$m4 (in $numcalls calls)<br>";
echo "<b>Array Indices:</b><br>". join(",", $selectedItems) . "<br>";
echo "<b>Chosen Items:</b><br>";
echo "<table border cellspacing=0>";
echo "<tr><td>Item</td><td>Value</td><td>Weight</td></tr>";
$totalValue = 0;
$totalWeight = 0;
foreach($selectedItems as $key)
{
$totalValue += $value[$key];
$totalWeight += $weight[$key];
echo "<tr><td>" . $items[$key] . "</td><td>" . $value[$key] . "</td><td>".$weight[$key] . "</td></tr>";
}
echo "<tr><td align=right><b>Totals</b></td><td>$totalValue</td><td>$totalWeight</td></tr>";
echo "</table><hr>";
?>

That knapsack program is traditional, but I think that it obscures what's going on. Let me show you how the DP can be derived more straightforwardly from a brute force solution.
In Python (sorry; this is my scripting language of choice), a brute force solution could look like this. First, there's a function for generating all subsets with breadth-first search (this is important).
def all_subsets(S): # brute force
subsets_so_far = [()]
for x in S:
new_subsets = [subset + (x,) for subset in subsets_so_far]
subsets_so_far.extend(new_subsets)
return subsets_so_far
Then there's a function that returns True if the solution is valid (within budget and with a proper position breakdown) – call it is_valid_solution – and a function that, given a solution, returns the total player value (total_player_value). Assuming that players is the list of available players, the optimal solution is this.
max(filter(is_valid_solution, all_subsets(players)), key=total_player_value)
Now, for a DP, we add a function cull to all_subsets.
def best_subsets(S): # DP
subsets_so_far = [()]
for x in S:
new_subsets = [subset + (x,) for subset in subsets_so_far]
subsets_so_far.extend(new_subsets)
subsets_so_far = cull(subsets_so_far) ### This is new.
return subsets_so_far
What cull does is to throw away the partial solutions that are clearly not going to be missed in our search for an optimal solution. If the partial solution is already over budget, or if it already has too many players at one position, then it can safely be discarded. Let is_valid_partial_solution be a function that tests these conditions (it probably looks a lot like is_valid_solution). So far we have this.
def cull(subsets): # INCOMPLETE!
return filter(is_valid_partial_solution, subsets)
The other important test is that some partial solutions are just better than others. If two partial solutions have the same position breakdown (e.g., two forwards and a center) and cost the same, then we only need to keep the more valuable one. Let cost_and_position_breakdown take a solution and produce a string that encodes the specified attributes.
def cull(subsets):
best_subset = {} # empty dictionary/map
for subset in filter(is_valid_partial_solution, subsets):
key = cost_and_position_breakdown(subset)
if (key not in best_subset or
total_value(subset) > total_value(best_subset[key])):
best_subset[key] = subset
return best_subset.values()
That's it. There's a lot of optimization to be done here (e.g., throw away partial solutions for which there's a cheaper and more valuable partial solution; modify the data structures so that we aren't always computing the value and position breakdown from scratch and to reduce the storage costs), but it can be tackled incrementally.

One potential small advantage with regard to composing recursive functions in PHP is that variables are passed by value (meaning a copy is made) rather than reference, which can save a step or two.
Perhaps you could better clarify what you are looking for by including a sample input and output. Here's an example that makes combinations from given groups - I'm not sure if that's your intention... I made the section accessing the partial result allow combinations with less value to be considered if their weight is lower - all of this can be changed to prune in the specific ways you would like.
function make_teams($players, $position_limits, $weights, $values, $max_weight){
$player_counts = array_map(function($x){
return count($x);
}, $players);
$positions = array_map(function($x){
$positions[] = [];
},$position_limits);
$num_positions = count($positions);
$combinations = [];
$hash = [];
$stack = [[$positions,0,0,0,0,0]];
while (!empty($stack)){
$params = array_pop($stack);
$positions = $params[0];
$i = $params[1];
$j = $params[2];
$len = $params[3];
$weight = $params[4];
$value = $params[5];
// too heavy
if ($weight > $max_weight){
continue;
// the variable, $positions, is accumulating so you can access the partial result
} else if ($j == 0 && $i > 0){
// remember weight and value after each position is chosen
if (!isset($hash[$i])){
$hash[$i] = [$weight,$value];
// end thread if current value is lower for similar weight
} else if ($weight >= $hash[$i][0] && $value < $hash[$i][1]){
continue;
// remember better weight and value
} else if ($weight <= $hash[$i][0] && $value > $hash[$i][1]){
$hash[$i] = [$weight,$value];
}
}
// all positions have been filled
if ($i == $num_positions){
$positions[] = $weight;
$positions[] = $value;
if (!empty($combinations)){
$last = &$combinations[count($combinations) - 1];
if ($weight < $last[$num_positions] && $value > $last[$num_positions + 1]){
$last = $positions;
} else {
$combinations[] = $positions;
}
} else {
$combinations[] = $positions;
}
// current position is filled
} else if (count($positions[$i]) == $position_limits[$i]){
$stack[] = [$positions,$i + 1,0,$len,$weight,$value];
// otherwise create two new threads: one with player $j added to
// position $i, the other thread skipping player $j
} else {
if ($j < $player_counts[$i] - 1){
$stack[] = [$positions,$i,$j + 1,$len,$weight,$value];
}
if ($j < $player_counts[$i]){
$positions[$i][] = $players[$i][$j];
$stack[] = [$positions,$i,$j + 1,$len + 1
,$weight + $weights[$i][$j],$value + $values[$i][$j]];
}
}
}
return $combinations;
}
Output:
$players = [[1,2],[3,4,5],[6,7]];
$position_limits = [1,2,1];
$weights = [[2000000,1000000],[10000000,1000500,12000000],[5000000,1234567]];
$values = [[33,5],[78,23,10],[11,101]];
$max_weight = 20000000;
echo json_encode(make_teams($players, $position_limits, $weights, $values, $max_weight));
/*
[[[1],[3,4],[7],14235067,235],[[2],[3,4],[7],13235067,207]]
*/

How to define trends according to some values?

I am trying to to mark some trends, so I have 1 as the lowest and 5 as the biggest value.
So for example,
I may have the following case:
5,4,5,5 (UP)
3,4, (UP)
4,3,3 (DOWN)
4,4,4,4, (FLAT - this is OK for all same numbers)
I am planning to have unlimited number of ordered values as input, an as an output I will just show an (UP), (DOWN), or (FLAT) image.
Any ideas on how I can achieve this?
Sorry if I am not descriptive enough.
Thank you all for you time.

Use least square fit to calculate the "slope" of the values.
function leastSquareFit(array $values) {
$x_sum = array_sum(array_keys($values));
$y_sum = array_sum($values);
$meanX = $x_sum / count($values);
$meanY = $y_sum / count($values);
// calculate sums
$mBase = $mDivisor = 0.0;
foreach($values as $i => $value) {
$mBase += ($i - $meanX) * ($value - $meanY);
$mDivisor += ($i - $meanX) * ($i - $meanX);
}
// calculate slope
$slope = $mBase / $mDivisor;
return $slope;
} // function leastSquareFit()
$trend = leastSquareFit(array(5,4,5,5));
(Untested)
If the slope is positive, the trend is upwards; if negative, it's downwards. Use your own judgement to decide what margin (positive or negative) is considered flat.

A little bit hard to answer based on the limited info you provide, but assuming that:
if there's no movement at all the trend is FLAT,
otherwise, the trend is the last direction of movement,
then this code should work:
$input = array();
$previousValue = false;
$trend = 'FLAT';
foreach( $input as $currentValue ) {
if( $previousValue !== false ) {
if( $currentValue > $previousValue ) {
$trend = 'UP';
} elseif( $currentValue < $previousValue ) {
$trend = 'DOWN';
}
}
$previousValue = $currentValue;
}

For your examples :
Calculate longest increasing subsequence, A
Calulate longest decreasing subsequence , B
Going by your logic, if length of A is larger than B , its an UP , else DOWN.
You will also need to keep track of all equals using one boolean variable to mark FLAT trend.
Query :
What trend would be :
3,4,5,4,3 ?
3,4,4,4,3 ?
1,2,3,4,4,3,2,2,1 ?
Then the logic might need some alterations depending upon what your requirements are .

I'm not sure if i understand your problem totally but I would put the values in an array and use a code like this (written in pseudocode):
int i = 0;
String trend = "FLAT":
while(i<length(array)) {
if(array(i)<array(i+1)) {
trend = "UP";
}
else if(array(i)>array(i+1) {
trend = "DOWN";
}
i++;
}
EDIT: this would obviously only display the trend of the latest alteration
one would also may count the number of times the trend is up or down and determine the overall trend by that values

echo foo(array(5,4,5,5)); // UP
echo foo(array(3,4)); // UP
echo foo(array(4,3,3)); // DOWN
echo foo(array(4,4,4,4)); // FLAT
function foo($seq)
{
if (count(array_unique($seq)) === 1)
return 'FLAT';
$trend = NULL;
$count = count($seq);
$prev = $seq[0];
for ($i = 1; $i < $count; $i++)
{
if ($prev < $seq[$i])
{
$trend = 'UP';
}
if ($prev > $seq[$i])
{
$trend = 'DOWN';
}
$prev = $seq[$i];
}
return $trend;
}

I used the code from #liquorvicar to determine Google search page rank trends, but added some extra trend values to make it more accurate:
nochange - no change
better (higher google position = lower number)
worse (lower google position = higher number)
I also added extra checks when the last value had no change, but taking in account the previous changes i.e.
worsenochange (no change, previouse was worse - lower number)
betternochange (no change, previouse was better - lower number)
I used these values to display a range of trend icons:
$_trendIndicator="<img title="trend" width="16" src="/include/main/images/trend-'. $this->getTrend($_positions). '-icon.png">";
private function getTrend($_positions)
{
// calculate trend based on last value
//
$_previousValue = false;
$_trend = 'nochange';
foreach( $_positions as $_currentValue ) {
if( $_previousValue !== false ) {
if( $_currentValue > $_previousValue ) {
$_trend = 'better';
} elseif( $_currentValue < $_previousValue ) {
$_trend = 'worse';
}
if ($_trend==='worse' && ($_previousValue == $_currentValue)) {$_trend = 'worsenochange';}
if ($_trend==='better' && ($_previousValue == $_currentValue)) {$_trend = 'betternochange';}
}
$_previousValue = $_currentValue;
}
return $_trend;
}

Function generating Unique Random Values Array

As Mysql rand() is time consuming I am using alternate way using Mysql max() and PHP. I wrote this code for fetching random product_id's:
function RandomUniqueArray($min,$max,$limit){
$random_array = array();
if(isset($limit) && is_numeric($limit)){
for($i=0;$i<$limit;){
$rand_val = mt_rand($min, $max);
if(!in_array($rand_val, $random_array)){
$random_array[] = $rand_val;
$i++;
}
}
}
return $random_array;
}
This works fine as I want each time it gives different result set with different unique values but it takes 6.232 micro seconds.
Ohter I got by Google is:
$random_array = array_rand(array_fill($min,$max, true),$limit);
with takes only 0.101 microseconds but its result set is repeated means. Unique values array is fine but whole set is repeated. Why is it so???
I called both of these by one by one as
$random_array = RandomUniqueArray(1,64000,4);
And
$random_array = array_rand(array_fill(1,64000, true),4);
Thank you.

I made a script,that only takes ̣̣̣+- 4.5E-6.
Try it.
function randomValue($min,$max,$limit)
{
$array = Array();
$MAX = mt_rand($min,$max);
for($i = 0;$i < $limit;$i++)
{
$array[$i] = mt_rand($min,$MAX);
while( is_array($array[$i],$array) ) //To check if exist, if. Make new.
{
$array[$i] = mt_rand($min,$MAX);
}
}
return $array;
}

Pearson correlation in PHP

I'm trying to implement the calculation of correlation coefficient of people between two sets of data in php.
I'm just trying to do the porting python script that can be found at this url
http://answers.oreilly.com/topic/1066-how-to-find-similar-users-with-python/
my implementation is the following:
class LB_Similarity_PearsonCorrelation implements LB_Similarity_Interface{
public function similarity($user1, $user2){
$sharedItem = array();
$pref1 = array();
$pref2 = array();
$result1 = $user1->fetchAllPreferences();
$result2 = $user2->fetchAllPreferences();
foreach($result1 as $pref){
$pref1[$pref->item_id] = $pref->rate;
}
foreach($result2 as $pref){
$pref2[$pref->item_id] = $pref->rate;
}
foreach ($pref1 as $item => $preferenza){
if(key_exists($item,$pref2)){
$sharedItem[$item] = 1;
}
}
$n = count($sharedItem);
if ($n == 0) return 0;
$sum1 = 0;$sum2 = 0;$sumSq1 = 0;$sumSq2 = 0;$pSum = 0;
foreach ($sharedItem as $item_id => $pre) {
$sum1 += $pref1[$item_id];
$sum2 += $pref2[$item_id];
$sumSq1 += pow($pref1[$item_id],2);
$sumSq2 += pow($pref2[$item_id],2);
$pSum += $pref1[$item_id] * $pref2[$item_id];
}
$num = $pSum - (($sum1 * $sum2) / $n);
$den = sqrt(($sumSq1 - pow($sum1,2)/$n) * ($sumSq2 - pow($sum2,2)/$n));
if ($den == 0) return 0;
return $num/$den;
}
}
clarification to better understand the code, the method fetchAllPreferences return back a set of objects that are actually the items, turns them into an array for ease of management
I'm not sure that this implementation is correct, in particular I have some doubts about the correctness of the calculation of the denominator.
any advice is welcome.
thanks in advance!

This is my solution:
function php_correlation($x,$y){
if(count($x)!==count($y)){return -1;}
$x=array_values($x);
$y=array_values($y);
$xs=array_sum($x)/count($x);
$ys=array_sum($y)/count($y);
$a=0;$bx=0;$by=0;
for($i=0;$i<count($x);$i++){
$xr=$x[$i]-$xs;
$yr=$y[$i]-$ys;
$a+=$xr*$yr;
$bx+=pow($xr,2);
$by+=pow($yr,2);
}
$b = sqrt($bx*$by);
if($b==0) return 0;
return $a/$b;
}
http://profprog.ru/korrelyaciya-na-php-php-simple-pearson-correlation/

Your algorithm looks mathematically correct but numerically unstable. Finding the sum of squares explicitly is a recipe for disaster. What if you have numbers like array(10000000001, 10000000002, 10000000003)? A numerically stable one-pass algorithm for calculating the variance can be found on Wikipedia, and the same principle can be applied to computing the covariance.
Easier yet, if you don't care much about speed, you could just use two passes. Find the means in the first pass, then compute the variances and covariances using the textbook formula in the second pass.

try my package here
http://www.phpclasses.org/browse/package/5854.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

K-means clustering: What's wrong? (PHP) - php

Related

Is there a faster way than array_diff in PHP

Knapsack Equation with item groups

How to define trends according to some values?

Function generating Unique Random Values Array

Pearson correlation in PHP

Categories

Resources