I'm designing a site and don't know how to rate the system in terms of logic.
Outcome is I want an item with 4 stars with 1000 votes to be ranked higher than an item with 1 vote of 5 stars. However, I don't want an item with 1 star with 1000 votes to be ranked higher than an item with 4 stars and 200 votes.
Anyone have any ideas or advice on what to do?
I found these two questions
Sorting by weighted rating in SQL?
MySQL Rating System - Find Rating
and they have their drawbacks and in the first one I don't understand what the winner means by "You may want to denormalize this rating value into event for performance reasons if you have a lot of ratings coming in." Please share some insight? Thank you!
Here's a quick sketch-up of such a system which works by defining a bonus factor xₙ for each flag number. According to your question you want:
x₄*4*1000 > x₅*1*5
and
x₁*1*1000 < x₄*4*200
Setting the factors to for example x₁=1, x₄=2 and x₅=2 will satisfy this, but you will of course want to adjust it and add the missing factors.
He means, you should put rating-data into the event-table (and thus have redundant data) to optimize it for performance.
See the wiki for Denormalization: http://en.wikipedia.org/wiki/Denormalization
The data you have to determine the rank of items is:
average rating
number of ratings
The hard part is probably to make rules for the ranking. Like: If the average rating for an item > 4 and the number of ratings < 4 treat it like rated 3.9
For convenience, I would put this value (how to treat the items for ranking) in the item-table.
Related
I have a PHP rating system (1-5), in which, some judges come rate some products. I want the results of these products to be fair. Normally what happens is some judges are very strict and may rate products only in the range of 1-2. While some judges rate products only in range of 4-5. Some judge correctly between 1-5.
Can some one give an idea or help in creating an algorithm for mean judges which scales the judges' ratings and compute the product score.
I thought of taking mean of the judges scores on all products but is that the way to go forward or some one has another good alternative to get fair results.
Edit
The rating system is not for an ecommerce application. Here there are only few judges say 10 who rate all the products. The product may be a song in a contest for example. Some of the judges may be very strict and some very liberal. There maybe several contests, so I have to record ratings of these very strict and liberal judges even for other contests and set a rule for them.
Simply put, you assign a weight to a judge based on the range of their typical votes (note, they must not be aware of this weight, or they will throw the system off.) Judges who always vote a single score get the lowest weight. Judges that give things a wide range of scores are considered more accurate.
This also assumes that these judges judge products with a fair range of quality; so if you give them a bunch of good or bad products and expect a range of vote levels, it might be unrealistic.
What you're looking for is the judge with the highest standard deviation (highest variation) in votes having the highest weight, whereas the judge with the lowest would have the least.
The non-algorithmic solution is (essentially) to run the algorithm on the judges, and then pick, American Idol style, judges that balance each other off to get what feels like an accurate result. In which case, you'd want to note the average vote as well as the standard deviation, and perhaps set three judges, one with the wide standard deviation, and then two narrows, one high and one low (liberal and strict) to judge it. This way they don't feel like they get 'less voice' because they are stricter or looser.
Then again, that could be an impetus for them to be less/more strict - if they are too easy or too hard on the product consistently they 'lose voice'.
It sounds like you may be trying to apply an algorithmic solution to a non-algorithmic problem. I'd think about why some "judges" vote only 1-2 and others vote only 4-5.
One possible cause could be self-selection. For example, people who bought an item online may be more likely to review the item if they were particularly disappointed or particularly pleased with their purchase. If this is your problem, you could try to to encourage shoppers to vote more, so that even those who had a non-extreme experience come back to vote.
Another possible issue may be guidance. Maybe your explanation of the rating system isn't clear to the judges. You can try to add a description of what each rating means, and see if that improves the quality of data.
In summary, any kind of a solution to your rating problem will need to have a "human" component and take into account the full story of how the judges choose ratings and why. There is not a whole lot that a ranking algorithm can do if your input data is poor quality. On the other hand, if your data has decent quality, then taking a mean works quite well.
One unrelated problem with taking a mean is that an item with one 5-star rating will rank above an item with hundred 5-star ratings + one 4-star rating. One simple solution is Laplace Smoothing, which addresses the problem by effectively starting every item with one vote of each value (1,2,3,4,5). You don't display the "smoothed" values, but you use them when sorting. See How Not To Sort By Average Rating post for an alternate solution.
How about truncated mean? Here is a good explanation of the idea.
EDIT
Let's say you have votes like: [1,4,3,2,5,1,1,3,2,4].
You need to sort the array in ascending order, giving you: [1,1,1,2,2,3,3,4,4,5].
Then let's say you want to get rid of 25% of the votes, which is 3 (rounding up). You simply discard three votes from the left and from the right, giving you [2,2,3,3].
Then, use arithmetic mean to get 2.5.
EDIT 2
Depending on your database schema, you could query the database to return the votes in ascending order. Then, calculate the percentage, use array_slice() to help you (read the documentation) and calculating the arithmetic mean is the least of your concerns now.
Is there a way to specify a sorting procedure for ORDER BY, or some kind of custom logic? What I need is to check some other data for the column being ordered, which is also in the row. For example if one column has a higher value than another, but a certain condition isn't met, it's sorted as lower. Right now I pull all the data in the column, sort it in PHP with usort(), and then paginate it, but this is a pretty bad performance hog. I would really like to move it into MySQL, is it possible? If so, how? :P
Thanks in advance!
Example of problem on the website here - the records get sorted on win percentage, but players who have 1 game played turn out on top with 100 % win. I'd like to set a threshold on games and then sort them lower, even though their win percentage is higher.
You can order by multiple expressions:
ORDER BY games_played < 10, wins / losses DESC
The first expression sorts all those players who have played 10 or more games above all the players that have playes fewer than 10 games. The second expression sorts by win/loss ratio. The second expression is only used to tie-break rows that were equal for the first expression. This means that a player who has played 10 games will always appear above a player who has played only 9 games regardless of their win/loss ratios.
There are a few hundred of book records in the database and each record has a publish time. In the homepage of the website, I am required to write some codes to randomly pick 10 books and put them there. The requirement is that newer books need to have higher chances of getting displayed.
Since the time is an integer, I am thinking like this to calculate the probability for each book:
Probability of a book to be drawn = (current time - publish time of the book) / ((current time - publish time of the book1) + (current time - publish time of the book1) + ... (current time - publish time of the bookn))
After a book is drawn, the next round of the loop will minus the (current time - publish time of the book) from the denominator and recalculate the probability for each of the remaining books, the loop continues until 10 books have been drawn.
Is this algorithm a correct one?
By the way, the website is written in PHP.
Feel free to suggest some PHP codes if you have a better algorithm in your mind.
Many thanks to you all.
Here's a very similar question that may help: Random weighted choice The solution is in C# but the code is very readable and close to PHP syntax so it should be easy to adapt.
For example, here's how one could do this in MySQL:
First calculate the total age of all books and store it in a MySQL user variable:
SELECT SUM(TO_DAYS(CURDATE())-TO_DAYS(publish_date)) FROM books INTO #total;
Then choose books randomly, weighted by their age:
SELECT book_id FROM (
SELECT book_id, TO_DAYS(CURDATE())-TO_DAYS(publish_date) AS age FROM books
) b
WHERE book_id NOT IN (...list of book_ids chosen so far...)
AND RAND()*#total < b.age AND (#total:=#total-b.age)
ORDER BY b.publish_date DESC
LIMIT 10;
Note that the #total decreases only if a book has passed the random-selection test, because of short-circuiting of AND expressions.
This is not guaranteed to choose 10 books in one pass -- it's not even guaranteed to choose any books on a given pass. So you have to re-run the second step until you've found 10 books. The #total variable retains its decreased value so you don't have to recalculate it.
First off I think your formula will guarantee that earlier books get picked. Try to set your initial probabilities based on:
Age - days since publication
Max(Age) - oldest book in the sample
Book Age(i) - age of book i
... Prob (i) = [Max (age) + e - Book Age (i)] / sum over all i [ Max (age) + e - Book age(i) ]
The value e ensures that the oldest book has some probability of being selected. Now that that is done, you can always recalc the prob of any sample.
Now you have to find an UNBIASED way of picking books. Probably the best way would be to calculate the cumulative distribution using the above then pick a uniform (0,1) r.v. Find where that r.v. is in the cumulative distribution and pick the book nearest to it.
Can't help you on the coding. Make sense?
What I am hoping to achieve is the ability to generate 'teams' of users. I will have x amount of men, weighted (decimal skill weight, like 75.23) and y amount of women (also with a skill weight value).
Given that list of users, I would then take for input the number of teams to make (let us say, 6 teams). Then, I go through the list of x's and y's and organize them so that the best average possible weighted teams are created. I would like to keep the teams balanced (women and men ratio)
I don't want 'stacked' teams, (best skilled in one team). I would like an even distribution of weight.
Curious how I could achieve this in PHP? I'd be using a MySQL database to fetch users with weight values. I would know ahead of time how many users I would have, also how many teams I would want to generate.
I would appreciate any suggestions, or links to a solution if anyone has found something similar like this. I'm just not a math wiz, so I don't know what formula would apply here.
Thanks. I appreciate any input!
EDIT
After reviewing the answers, maybe I was not clear enough, so hopefully this helps a little more.
I want the teams to be roughly equally-sized
I want the average (mean) skill score for each team to be roughly equal
I want the ratio of men to women in each team to be roughly equal (that is to say, if by division, we get a distribution, of 5 men and 3 women per team, I would like to keep that roughly the same). Not really an issue if I sort men first, and women second (or vise-versa).
I don't want a linear approach (team 1 gets highest, team 2, sec highest, team 3.. so on). Tim's method of taking (if 6 teams) 6 people and randomizing and then distributing via linear fashion seems to work out fine.
I'm not entirely clear what you're after here, so I'll recap on what I understand you to be asking. If this is not right, you can clarify your requirements by editing your question:
You have a list of a certain number of men and a certain number of women. Each person has a known skill score. You want to divide these into a certain number of teams, with the following aims:
you want the teams to be roughly equally-sized
you want the average (mean) skill score for each team to be roughly equal
you want the ratio of men to women in each team to be roughly equal
I would have thought that a simple method to achieve this would be:
Create a list of all the men in decreasing order of skill score.
Create a list of all the women in decreasing order of skill score.
Add the list of women to the end of the list of men.
Start at the beginning of the combined list, and allocated each person in turn to a team in a round-robin fashion. (That is to say, allocate the first person to team number one, the second to team number two, and so on until you have allocated one person to each of the teams you wish to create. Then start again with team one, allocating people to each team in order, and so on.)
With this approach, you will be guaranteed the following outcomes:
If possible (i.e. if the number of teams divides the total number of people), the teams will all have the same number of people.
If the teams are not all the same size, the largest team will have exactly one more person than the smallest team.
If possible the teams will all have the same number of men.
If the teams do not have the same number of men, the team with the most men exactly one more man than the team with the least men.
If possible the teams will all have the same number of women.
If the teams do not have the same number of women, the team with the most women exactly one more man than the team with the least women.
Each team will have men with a range of skill scores, from near the top of the range to near the bottom of the range.
Each team will have women with a range of skill scores, from near the top of the range to near the bottom of the range.
With sensible data, the mean skill score for each team will be roughly equal (although team one will have a slightly higher mean score than team two, and so on - there are ways of correcting this).
If this simple approach doesn't meet your requirements, please let us know what else you had in mind.
This is similar to "maximum/minimum weight perfect matching", just that the matching is for more than two elements (note that this is a different weight from what you have (the skill weight), namely, you would assign a weight to a matching (a matching would be a proposed 'team')).
The known algorithms for the perfect matching above (e.g., Edmond's algorithm) might not be adaptable to the group case. I would perhaps look into some simulated annealing technique or a simple genetic algorithm.
If the number of people in each group (x,y) is relatively even, and the total number of people is relatively high random sampling should work quite well. See here on how to select random rows from a MySQL database:
http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand
Slight edit, to ensure fairness personally I'd do something like this. Say you know you want n members per team. Then create a local variable which is n*mean where mean is the average skill level per person. Then when your randomly selecting your team members do so within that limit.
E.g.
while(new random record){
if(team_skill+random person skill > n*mean){
next;
}
if(team_skill+random person skill < n*mean && selected team members =n){
team + random person;
break;
}
}
I have a newssystem where you can rate News with 1 to 5 stars. In the Database i save the count, the sum and the absolute rating as int up to 100 (for html output, so 5 stars would be 100 1 star would be 20percent.
Now i have three toplists:
Best Rated
Most viewed
Most commented
Last two ones are simple, but the first is kinda tricky.
Before i took that thing over it was all a big mess, and they just put the 5 best rated news there, so in fact if there was a news rated 4.995 with 100k votes and another one with 5 stars at 1 vote, the "better rated" one is on top even if that is obv ridiculous.
For the first moment i capped the list so only news with a certain amount of votes (like 10 or 20) can be in the list.
But i do not really like that. Is there a nice method to kind-a give those things a "weight" with the count or something like that?
Have you considered using a weighted bayesian rating system? It'll weight the results based on the number of votes and the vote values themselves.
You could explore the statistical confidence in the rating perhaps based around the average rating received for all entries and the standard deviation of all votes. While an entry has an average rating of 5, if you only have a few votes then you may not be able to say with more than 90% confidence that the actual rating is above 4.7 say. You can then rate the entries based upon the rating for which you have 90% confidence.
I'm not sure if this meets your requirement of being simple.
You could use median of the user ratings as the total rating.
You would have five fields with eatch article, each one containing how many times the article was rated as n stars. Then you would select the field with the biggest value of all these and that would be your rating. It has the advantage of ignoring the outliers in the ratings.