how to rate or rank votes - php

i'm really sorry if i'm wrong with my question but i want some idea...i want to have and idea of ranking algorithm with include time they submit there votes.

Nice Question!
Okay lets bring it on!
First of all one thing you cannot when calculating good ratings is Bayesianaverage
You ran read up on it but very simplified it takes care of the following:
Entries with little votes are not the true mean of their votes but have a componentn of the mean rating throughout your dataset. For example on IMDB the default rating is somewhere at 6.4. So a film with only 2 votes which were 10 stars each may still have something between 6 and 7. The more votes the more meaning they become alltogether and the rating is "pulled away" from the default. Imdb also implements a minimum number of votes for their movies to show up in listings.
Another thing that I find confusing is: Why is the time of the vote important? Didn't you mean the time of the entry that was voted on? So in our movies example just released movies are more important?
But anyway! In both cases good results are often achieved by applying logarithmic functions.
For our movie example movies relevance could be multiplied by
1 + 1/SQRT(1 + CURRENT_YEAR - RELEASE_YEAR )
So 1 is a socket rating that every movie gets.
A movie from teh current year will have a boost of 100% (200% relevance) as the above will return true. Last year 170%, 2 Years old 157% and son on.
But the difference of a movie from 1954 or 1963 is far not so great.
So remember:
Everything you use in your calculations. Is it really linear? May it distort your ratings? Are the relations throughout the dataset sane?
If you want to have recent votes cast more you can do that the exact same way but weight your votes. It makes sense too if you want recent voted stuff be "warmed up"... Because it is currently hot and discussed in your community for example.
That beeing said it remains just hard work. A lot of playing around etc.
Let me give you one last idea.
At the company I work we calculate a relevance for movies.
We have a config array where we store the "weighting" of several factors in the final relevance.
It looks like this:
$weights = array(
"year" => 2, // release year
"rating" =>13, // rating 0-100
"cover" => 4, // cover available?
"shortdescription" => 4, // short descr available?
"trailer" => 3, // trailer available?
"imdbpr" => 13, // google pagerank of imdb site
);
Then we calculate a value between 0 and 1 for every metric. There are different methods. But let me show you the example of our rating (which is itself an aggregated rating of several platforms that we crawl and that have different weightings themsevles etc.)
$yearDiff = $data["year"] - date('Y');
//year
if (!$data["year"]){
$values['year'] = 0;
} else if($yearDiff==0) {
$values['year'] = 1;
} else if($yearDiff < 3) {
$values['year'] = 0.8;
} else if($yearDiff < 10) {
$values['year'] = 0.6;
} else {
$values['year'] = 1/sqrt(abs($yearDiff));
}
So you see we hardcoded some "age intervals" and relyed on the sqrt function only for older movies. In fact the difference there is minimal so the SQRT example here is very poor.
But mathematical functions are very often useful!
You can, for example, also use periodic functions like sinus curves etc to calculate seasonal relevance! For example your year has a range from 0-1 then you can use sinus function to weight up summer hits / winter hits / autumn hits for the current time of the year!
One last example for the IMDB pagerank. It is completely hardcoded as there are only 10 different values possible and they are not distributed in an statistical homogenous way (pagerank 1 or 2 is even worse than none):
if($imdbpr >= 7) {
$values['imdbpr'] = 1;
} else if($imdbpr >= 6) {
$values['imdbpr'] = 0.9;
} else if($imdbpr >= 5) {
$values['imdbpr'] = 0.8;
} else if($imdbpr >= 4) {
$values['imdbpr'] = 0.6;
} else if($imdbpr >= 3) {
$values['imdbpr'] = 0.5;
} else if($imdbpr >= 2) {
$values['imdbpr'] = 0.3;
} else if($imdbpr >= 1) {
$values['imdbpr'] = 0.1;
} else if($imdbpr >= 0) {
$values['imdbpr'] = 0.0;
} else {
$values['imdbpr'] = 0.4; // no pagerank available. probably new
}
Then we sum it up like this:
foreach($values as $field=>$value) {
$malus += ($value*$weights[$field]) / array_sum($weights);
}
This may not be an exact answer to your question but a bit more and broadly, but I hope I pointed you in the right direction and gave you some points where your thoughts can pick up!
Have fun and success with your application!

Reddit's code is open source. There is a pretty good discussion of their ranking algorithm here, with code: http://amix.dk/blog/post/19588

Related

PHP - Optimize finding closest point in an Array

I have created a script which gets a big array of points and then finds the closest point in 3D-space based on a limited array of chosen points. It works great. However, sometimes I get like over 2 Million points to compare to an array of 256 items so it is over 530 million calculations! Which takes a considerable amount of time and power (taking that it will be comparing stuff like that few times a min).
I have a limited group of 3D coordinates like this:
array (size=XXX)
0 => 10, 20, 30
1 => 200, 20, 13
2 => 36, 215, 150
3 => ...
4 => ...
... // this is limited to max 256 items
Then I have another very large group of, let's say, random 3D coordinates which can vary in size from 2,500 -> ~ 2,000,000+ items. Basically, what I need to do is to iterate through each of those points and find the closest point. To do that I use Euclidean distance:
sq((q1-p1)2+(q2-p2)2+(q3-p3)2)
This gives me the distance and I compare it to the current closest distance, if it is closer, replace the closest, else continue with next set.
I have been looking on how to change it so I don't have to do so many calculations. I have been looking at Voronoi Diagrams then maybe place the points in that diagram, then see which section it belongs to. However, I have no idea how I can implement such a thing in PHP.
Any idea how I can optimize it?
Just a quick shot from the hip ;-)
You should be able to gain a nice speed up if you dont compare each point to each other point. Many points can be skipped because they are already to far away if you just look at one of the x/y/z coordinates.
<?php
$coord = array(18,200,15);
$points = array(
array(10,20,30),
array(200,20,13),
array(36,215,150)
);
$closestPoint = $closestDistance= false;;
foreach($points as $point) {
list($x,$y,$z) = $point;
// Not compared yet, use first poit as closest
if($closestDistance === false) {
$closestPoint = $point;
$closestDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
continue;
}
// If distance in any direction (x/y/z) is bigger than closest distance so far: skip point
if(abs($coord[0] - $x) > $closestDistance) continue;
if(abs($coord[1] - $y) > $closestDistance) continue;
if(abs($coord[2] - $z) > $closestDistance) continue;
$newDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
if($newDistance < $closestDistance) {
$closestPoint = $point;
$closestDistance = distance($x,$y,$z,$coord[0],$coord[1],$coord[2]);
}
}
var_dump($closestPoint);
function distance($x1,$y1,$z1,$x2,$y2,$z2) {
return sqrt(pow($x1-$x2,2) + pow($y1 - $y2,2) + pow($z1 - $z2,2));
}
A working code example can be found at http://sandbox.onlinephpfunctions.com/code/8cfda8e7cb4d69bf66afa83b2c6168956e63b51e

Coordinate (x,y) list to be sort with a spiral algorithm

I have a list of coordinate to be sorted with a spiral algorithm. My need is to start on the middle of the area and "touch" any coordinate.
To simplify this is the representation of the (unsorted) list of coordinates (x,y marked with a "dot" on following image).
CSV list of coordinates is available here.
X increase from left to right
Y increases from TOP to BOTTOM
Every coordinate is not adjacent to the following one but are instead distanciated by 1 or 2 dice (or more in certain case).
Starting from the center of the area, I need to touch any coordinate with a spiral movement:
to parse each coordinate I've drafted this PHP algorithm:
//$missing is an associative array having as key the coordinate "x,y" to be touched
$direction = 'top';
$distance = 1;
$next = '128,127'; //starting coordinate
$sequence = array(
$next;
)
unset($missing[$next]);
reset($missing);
$loopcount = 0;
while ($missing) {
for ($loop = 1; $loop <= 2; $loop++) {
for ($d = 1; $d <= $distance; $d++) {
list($x,$y) = explode(",", $next);
if ($direction == 'top') $next = ($x) . "," . ($y - 1);
elseif ($direction == 'right') $next = ($x + 1) . "," . ($y);
elseif ($direction == 'bottom') $next = ($x) . "," . ($y + 1);
elseif ($direction == 'left') $next = ($x - 1) . "," . ($y);
if ($missing[$next]) {
unset($missing[$next]); //missing is reduced every time that I pass over a coordinate to be touched
$sequence[] = $next;
}
}
if ($direction == 'top') $direction = 'right';
elseif ($direction == 'right') $direction = 'bottom';
elseif ($direction == 'bottom') $direction = 'left';
elseif ($direction == 'left') $direction = 'top';
}
$distance++;
}
but as coordinate are not equidistant from each other, I obtain this output:
As is clearly visible, the movement in the middle is correct whereas and accordingly with the coordinate position, at a certain instant the jump between each coordinate are not anymore coherent.
How can I modify my code to obtain an approach like this one, instead?
To simplify/reduce the problem: Imagine that dots on shown above image are cities that the salesman have to visit cirurarly. Starting from the "city" in the middle of the area, the next cities to be visited are the ones located near the starting point and located on North, East, Soutch and West of the starting point. The salesman cannot visit any further city unless all the adjacent cities in the round of the starting point hadn't been visited. All the cities must be visited only one time.
Algorithm design
First, free your mind and don't think of a spiral! :-) Then, let's formulate the algorithms constraints (let's use the salesman's perspective):
I am currently in a city and am looking where to go next. I'll have to find a city:
where I have not been before
that is as close to the center as possible (to keep spiraling)
that is as close as possible to my current city
Now, given these three constraints you can create a deterministic algorithm that creates a spiral (well at least for the given example it should, you probably can create cases that require more effort).
Implementation
First, because we can walk in any direction, lets generally use the Euclidean distance to compute distances.
Then to find the next city to visit:
$nextCost = INF;
$nextCity = null;
foreach ($notVisited as $otherCity) {
$cost = distance($current_city, $other_city) + distance($other_city, $centerCity);
if ($cost < $nextCost) {
$nextCost = $cost;
$nextCity = $otherCity;
}
}
// goto: $nextCity
Just repeat this until there are no more cities to visit.
To understand how it works, consider the following picture:
I am currently at the yellow circle and we'll assume the spiral up to this point is correct. Now compare the length of the yellow, pink and blue lines. The length of those lines is basically what we compute using the distance functions. You will find that in every case, the next correct city has the smallest distance (well, at least as long as we have as many points everywhere, you probably can easily come up with a counter-example).
This should get you started to implement a solution for your problem.
(Correctness) Optimization
With the current design, you will have to compare the current city to all remaining cities in each iteration. However, some cities are not of interest and even in the wrong direction. You can further optimize the correctness of the algorithm by excluding some cities from the search space before entering the foreach loop shown above. Consider this picture:
You will not want to go to those cities now (to keep spiraling, you shouldn't go backwards), so don't even take their distance into account. Albeit this is a little more complicated to figure out, if your data points are not as evenly distributed as in your provided example, this optimization should provide you a healthy spiral for more disturbed datasets.
Update: Correctness
Today it suddenly struck me and I rethought the proposed solution. I noticed a case where relying on the two euclidean distances might yield unwanted behavior:
It is easily possible to construct a case where the blue line is definitely shorter than the yellow one and thus gets preferred. However, that would break the spiral movement. To eliminate such cases we can make use of the travel direction. Consider the following image (I apologize for the hand-drawn angles):
The key idea is to compute the angle between the previous travel direction and the new travel direction. We are currently at the yellow dot and need to decide where to go next. Knowing the previous dot, we can obtain a vector representing the previous direction of the movement (e.g. the pink line).
Next, we compute the vector to each city we consider and compute the angle to the previous movement vector. If that vector is <= 180 deg (case 1 in the image), then the direction is ok, otherwise not (case 2 in the image).
// initially, you will need to set $prevCity manually
$prevCity = null;
$nextCost = INF;
$nextCity = null;
foreach ($notVisited as $otherCity) {
// ensure correct travel direction
$angle = angle(vectorBetween($prevCity, $currentCity), vectorBetween($currentCity, $otherCity));
if ($angle > 180) {
continue;
}
// find closest city
$cost = distance($current_city, $other_city) + distance($other_city, $centerCity);
if ($cost < $nextCost) {
$nextCost = $cost;
$nextCity = $otherCity;
}
}
$prevCity = $currentCity;
// goto: $nextCity
Pay attention to compute the angle and vectors correctly. If you need help on that, I can elaborate further or simply ask a new question.
The problem seems to be in the if-conditional when you missing traverse a co-ordinate, I.e because of rounding of the corners. A else conditional with a reverse to the previous calculation of the co-ordinate would fix it.

Calculating overall rating

If i have a series of 10 objects with rating from 1 to 10. Then how can i calculate overall rating?
For example if i have a list like this:
Entertainment - 8/10
Fun - 9/10
Comedy - 6/10
Dance - 8/10
and so on... Like this 10 objects. Tell me how to calculate the overall rating for 10.
Overall - ?/10
I am very weak in maths. I was told by someone to add the total and if I got 83 as the answer, then the overall rating will be 8.3/10. Is this correct?
I am doing this for my PHP website. So if someone knows how to write a query for this, that would be very helpful for me.
Average the total rating and you will get the answer.
he one that is told for will stand correct if there are 10 criteria on which scoring is to be made.
SELECT avg(score) FROM tbl
There is inbuilt function available for it
Refer
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_avg
Yes, to get the average, add them all together and divide by the amount. Example:
//do a MySQL query instead of this
$result_out_of_10 = array(
'fun' => 9,
'comedy' => 6,
'dance' => 8
);
$total = 0;
$total_results = 0;
foreach( $result_out_of_10 as $result )
{
$total += $result;
$total_results++;
}
$final_average_out_of_10 = $total / $total_results;
print "Average rating: $final_average_out_of_10 out of 10.";
EDIT: Meherzad has a better way - using the MySQL AVG() function - which I didn't know about. Use his way instead (although mine still works, it's more code than necessary).

Calculate average without being thrown by strays

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.

How do I pick a selection of rows with the minimum date difference

The question was difficult to phrase. Hopefully this will make sense.
I have a table of items in my INVENTORY.
Let's call the items Apple, Orange, Pear, Potato. I want to pick a basket of FRUIT (1 x Apple,1 x Orange, 1 x Pear).
Each item in the INVENTORY has a different date for availability. So that...
Apple JANUARY
Apple FEBRUARY
Apple MARCH
Orange APRIL
Apple APRIL
Pear MAY
I don't want to pick the items in the order they appear in the inventory. Instead I want to pick them according to the minimum date range in which all items can be picked. ie Orange & Apple in APRIL and the pear in MAY.
I'm not sure if this is a problem for MYSQL or for some PHP arrays. I'm stumped. Thanks in advance.
If array of fruits isn't already sorted by date, let's sort it.
Now, the simple O(n^2) solution would be to check all possible ranges. Pseudo-code in no particular language:
for (int i = 0; i < inventory.length; ++i)
hash basket = {}
for (int j = i; j < inventory.length; ++j) {
basket.add(inventory[j]);
if (basket.size == 3) { // or whatever's the number of fruits
// found all fruits
// compare range [i, j] with the best range
// update best range, if necessary
break;
}
}
end
You may find it's good enough.
Or you could write a bit more complicated O(n) solution. It's just a sliding window [first, last]. On each step, we move either left border (excluding one fruit from the basket) or right (adding one fruit to the basket).
int first = 0;
int last = 0;
hash count = {};
count[inventory[0]] = 1;
while (true) {
if (count[inventory[first]] > 0) {
--count[inventory[first]];
++first;
} else if (last < inventory.length) {
++last;
++count[inventory[last]];
} else {
break;
}
if (date[last] - date[first] < min_range
&& count.number_of_nonzero_elements == 3) {
// found new best answer
min_range = date[last] - date[first]
}
}
Given you table inventory is structured:
fruit, availability
apple, 3 // apples in march
//user picks the availability month maybe?
$this_month = 5 ;
//or generate it for today
$this_month = date('n') ;
// sql
"select distinct fruit from inventory where availability = $this_month";
Sound quite complicated. The way that I would approach the problem is to group each fruit into its availability month group and see how many are in each group.
JANUARY (1)
FEBRUARY (1)
MARCH (1)
APRIL (2)
MAY (1)
To see that the most fruits fall within APRIL. So APRIL is therefore our preferred month.
I would then remove the items from months with duplicates (Apples in your example), which would remove MARCH as an option. This step could either be done now, or after the next step depending on your data and the results you get.
I would then look at the next most popular month and calculate how far away that month is (eg. JAN is 3 away from APRIL, MARCH is 1 etc). If you then had a tie then it shouldn't matter which you choose. In this example though you would end up choosing the 2 fruits from APRIL and 1 fruit from MAY as you requested.
This approach may not work if the most popular month doesn't actually result in the "best" selection.

Categories