Order By Two Columns - Using Highest Rating Average with Most Ratings - php

I would like to show ratings with the highest average (rating_avg) AND number of ratings(rating_count). With my current script, it shows the highest average rating (DESC) regardless of how many ratings there are, which is useless for my visitors.
For example it shows:
Item 1 - 5.0 (1 Ratings)
Item 2 - 5.0 (2 Ratings)
When it should be showing the Top 10 Highest rated items by rating avg and amount of ratings, such as:
Item 1 - 4.5 (356 Ratings)
Item 2 - 4.3 (200 Ratings)
Item 3 - 4.0 (400 Ratings)
This is what I have right now:
$result = mysql_query("SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count FROM files WHERE console_dir = '".$nodash."' ORDER BY rating_avg DESC LIMIT 10");
Thanks and I appreciate any help in advance!

This is a subtle problem and an issue in statistics. What I do is often to downgrade the ratings by one standard error for the proportion. These aren't exactly proportions, but I think the same idea can be applied.
You can calculate this using the "square root of p*q divided by n" method. If you don't understand this, google "standard error of a proportion" (or I might suggest the third chapter in "Data Analysis Using SQL and Excel" which explains this in more detail):
SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count
FROM files cross join
(select count(*) as cnt from files where console_dir = '".$nodash."') as const
WHERE console_dir = '".$nodash."'
ORDER BY rating_avg/5 - sqrt((rating_avg/5) * (1 - rating_avg/5) / const.cnt) DESC
LIMIT 10;
In any case, see if the formula works for you.
EDIT:
Okay, let's change this to the standard error of the mean. I should have done this the first time through, but I was thinking the rating_avg was a proportion. The formula is the standard deviation divided by the square root of the sample size. We can get the population standard deviation in the const subquery:
(select count(*) as cnt, stdev(rating_avg) as std from files where console_dir = '".$nodash."') as const
This results in:
order by rating_avg - std / sqrt(const.cnt)
This might work, but I would rather have the standard deviation within each group rather than the overall population standard deviation. But, it derates the rating by an amount proportional to the size of the sample, which should improve your results.
By the way, the idea of removing one standard deviation is rather arbitrary. I've just found that it produces reasonable results. You might prefer to take, say, 1.96 times the standard deviation to get a 95% lower bound on the confidence interval.

Related

time decay factor for posts / updates in newsfeed using neo4j

i am using neo4j to retrieve news feed using this query.
MATCH (u:Users {user_id:140}),(p:Posts)-[:CREATED_BY]->(pu:Users)
WHERE (p)-[:CREATED_BY]->(u) OR (p:PUBLIC AND (u)-[:FOLLOW]->(pu)) OR
(p:PRIVATE AND (p)-[:SHARED_WITH]->(u))
OPTIONAL MATCH (p)-[:POST_MEDIA]->(f)
OPTIONAL MATCH (p)-[:COMMENT]->(c)<-[:COMMENT]-(u3) RETURN
(p.meta_score+0.2*p.likes+0.1*p.dislikes + 10/(((".time()."-
p.created_time)/3600)+0.1)) as score,
{user_id:pu.user_id,firstname:pu.firstname,lastname:pu.lastname,
profile_photo:pu.profile_photo,username:pu.username} as pu, p,
collect({user_id:u3.user_id,profile_photo:u3.profile_photo,text:c.text}) as comment,
collect(f) as file ORDER BY score DESC,
p.post_id DESC LIMIT 25
In this equation for getting score right now i am using mainly this equation p.meta_score+0.1*p.likes-0.05*p.dislikes + 10/(((current_time-
p.created_time)/3600)+0.1)) as score here i hace added 0.1 to prevent infinity error as current_time may be nearly equal to post created_time( as p refer post class)
Here its nice for single day but after a day the time part doesn't contribute well total score as the way i am calculating time decay factor is not consistent i need a equation which plays its role consistently (I means decrease score at lesser rate) for first seven days and than start decreasing its contribution towards score at an higher rate. one way was using trigonometry's tan or cot functions but the problem is that after some intervals they changes there signs.I shall be thankfull to everybody gives me further suggestions.
At a basic level, it is common to use an exponential time decay function here. Something like:
score = score / elapsedTime^2
As elapsed time since the post increases, the value of the score decreases exponentially. Sites like Reddit and Hacker News use much more complicated algorithms, but that is the basic idea.

MYSQL sorting content by rating logic and opinion?

I'm designing a site and don't know how to rate the system in terms of logic.
Outcome is I want an item with 4 stars with 1000 votes to be ranked higher than an item with 1 vote of 5 stars. However, I don't want an item with 1 star with 1000 votes to be ranked higher than an item with 4 stars and 200 votes.
Anyone have any ideas or advice on what to do?
I found these two questions
Sorting by weighted rating in SQL?
MySQL Rating System - Find Rating
and they have their drawbacks and in the first one I don't understand what the winner means by "You may want to denormalize this rating value into event for performance reasons if you have a lot of ratings coming in." Please share some insight? Thank you!
Here's a quick sketch-up of such a system which works by defining a bonus factor xₙ for each flag number. According to your question you want:
x₄*4*1000 > x₅*1*5
and
x₁*1*1000 < x₄*4*200
Setting the factors to for example x₁=1, x₄=2 and x₅=2 will satisfy this, but you will of course want to adjust it and add the missing factors.
He means, you should put rating-data into the event-table (and thus have redundant data) to optimize it for performance.
See the wiki for Denormalization: http://en.wikipedia.org/wiki/Denormalization
The data you have to determine the rank of items is:
average rating
number of ratings
The hard part is probably to make rules for the ranking. Like: If the average rating for an item > 4 and the number of ratings < 4 treat it like rated 3.9
For convenience, I would put this value (how to treat the items for ranking) in the item-table.

Weighted randomness. How could I give more weight to rows that have just been added to the database?

I'm retrieving 4 random rows from a table. However, I'd like it such that more weight is given to rows that had just been inserted into the table, without penalizing older rows much.
Is there a way to do this in PHP / SQL?
SELECT *, (RAND() / id) AS o FROM your_table ORDER BY o LIMIT 4
This will order by o, where as o is some random integer between 0 and 1 / id, which means, the older your row, the lower it's o value will be (but still in random order).
I think an agreeable solution would be to use an asymptotic function (1/x) in combination with weighting.
The following has been tested:
SELECT *, (Rand()*10 + (1/(max_id - id + 1))) AS weighted_random
FROM tbl1
ORDER BY weighted_random
DESC LIMIT 4
If you want to get the max_id within the query above, just replace max_id with:
(SELECT id FROM tbl1 ORDER BY id DESC LIMIT 1)
Examples:
Let's say your max_id is 1000 ...
For each of several ids I will calculate out the value:
1/(1000 - id + 1) , which simplifies out to 1/(1001 - id):
id: 1000
1/(1001-1000) = 1/1 = 1
id: 999
1/(1001-999) = 1/2 = .5
id: 998
1/(1001-998) = 1/3 = .333
id: 991
1/(1001-991) = 1/10 = .1
id: 901
1/(1001-901) = 1/100 = .01
The nature of this 1/x makes it so that only the numbers close to max have any significant weighting.
You can see a graph of + more about asymptotic functions here:
http://zonalandeducation.com/mmts/functionInstitute/rationalFunctions/oneOverX/oneOverX.html
Note that the right side of the graph with positive numbers is the only part relevant to this specific problem.
Manipulating our equation to do different things:
(Rand()*a + (1/(b*(max_id - id + 1/b))))
I have added two values, "a", and "b"... each one will do different things:
The larger "a" gets, the less influence order has on selection. It is important to have a relatively large "a", or pretty much only recent ids will be selected.
The larger "b" gets, the more quickly the asymptotic curve will decay to insignificant weighting. If you want more of the recent rows to be weighted, I would suggest experimenting with values of "b" such as: .5, .25, or .1.
The 1/b at the end of the equation offsets problems you have with smaller values of b that are less than one.
Note:
This is not a very efficient solution when you have a large number of ids (just like the other solutions presented so far), since it calculates a value for each separate id.
... ORDER BY (RAND() + 0.5 * id/maxId)
This will add half of the id/maxId ration to the random value. I.e. for the newest entry 0.5 is added (as id/maxId = 1) and for the oldest entry nothing is added.
Similarly you can also implement other weighting functions. This depends on how exactly you want to weight the values.

Adding an extra factor (number of clicks) to a Bayesian ranking system

I run a music website for amateur musicians where we have a rating system based on a score out of 10, which is then calculated into an overall score out of 100. We have a "credibility" points system for users which directly influences the average score at the point of rating, but the next step is to implement a chart system which uses this data effectively.
I'll try and explain exactly how it all works so you can see which data I have at my disposal.
A site member rates a track between 1 and 10.
That site member has a "credibility" score, which is just a total of points accumulated for various activities around the site. A user gains, for example, 100 points for giving a rating so the more ratings they give, the higher their "credibility" score. Only the total credibility score is saved in the database, updated each time a user performs an activity with a points reward attached. These individual activities are not stored.
Based on the credibility of this user compared to other users who have rated the track, a weighted average is calculated for the track, which is then stored as a number between 1 and 100 in the tracks table.
In the tracks table, the number of times a track is listened to (i.e. number of plays) is also stored as a total.
So the data I have to work with is:
Overall rating for the track (number between 1 and 100)
Number of ratings for the track
Number of plays for the track
In the chart system I want to create a ranking that uses the above 3 sets of data to create a fair balance between quality (overall rating, normalized with number of ratings) and popularity (number of plays). BUT the system should factor quality more heavily than popularity, so for example the quality aspect makes up 75% of the normalized ranking and popularity 25%.
After a search on this site I found the IMDB Bayesian-style system which is helpful for working out the quality aspect, but how do I add in the popularity (number of plays) and have it balanced in the way I want?
The site is written in PHP and MySQL if that helps.
EDIT: the title says "number of clicks" but this is basically the direct equivalent of "number of plays".
You may want to try the following. The IMDB equation you mentioned uses weighing to lean toward either the average rating of the movie or the average rating of all movies:
WR = (v/(v+m)) × R + (m/(v+m)) × C
So
v << m => v/(v+m) -> 0; m/(v+m) -> 1 => WR -> C
and
v >> m => v/(v+m) -> 1; m/(v+m) -> 0 => WR -> R
This should generally be fair. Calculating a popularity score between 0 and 100 based on the number of plays is pretty tricky unless you really know your data. As a first try calculate the average number of plays avg(p) and the variance var(p) you can then use these to scale the number of plays using a technique call whitening:
WHITE(P) = (p - avg(p))/var(p)
This will give you a score between -1 and 1 by assuming your data looks like a bell curve. You can then scale this to be in the range 0 - 100 by scaling again:
POP = 50 * (1 + WHITE(P))
To combine the score based on some weighting factor w (e.g. 0.75) you'd simply do:
RATING = w x WR + (1 - w) x POP
Play with these and let me know how you get on.
NOTE: this does not account for the fact that a use can "game" the popularity buy playing a track many times. You could get around this by penalising multiple plays of a single song:
deltaP = (1 - (Puser - 1)/TPuser)
Where:
deltaP = Change in # plays
Puser = number of time this user has played this track
TPuser = total number of tracks (not unique) played by the user
So the more times a user plays just the one track the less it counts toward the total number of plays for that track. If the users listening habits are diverse then TPuser will be large and so deltaP will tend back to 1. This still can be gamed but is a good start.

Making more recent items more likely to be drawn

There are a few hundred of book records in the database and each record has a publish time. In the homepage of the website, I am required to write some codes to randomly pick 10 books and put them there. The requirement is that newer books need to have higher chances of getting displayed.
Since the time is an integer, I am thinking like this to calculate the probability for each book:
Probability of a book to be drawn = (current time - publish time of the book) / ((current time - publish time of the book1) + (current time - publish time of the book1) + ... (current time - publish time of the bookn))
After a book is drawn, the next round of the loop will minus the (current time - publish time of the book) from the denominator and recalculate the probability for each of the remaining books, the loop continues until 10 books have been drawn.
Is this algorithm a correct one?
By the way, the website is written in PHP.
Feel free to suggest some PHP codes if you have a better algorithm in your mind.
Many thanks to you all.
Here's a very similar question that may help: Random weighted choice The solution is in C# but the code is very readable and close to PHP syntax so it should be easy to adapt.
For example, here's how one could do this in MySQL:
First calculate the total age of all books and store it in a MySQL user variable:
SELECT SUM(TO_DAYS(CURDATE())-TO_DAYS(publish_date)) FROM books INTO #total;
Then choose books randomly, weighted by their age:
SELECT book_id FROM (
SELECT book_id, TO_DAYS(CURDATE())-TO_DAYS(publish_date) AS age FROM books
) b
WHERE book_id NOT IN (...list of book_ids chosen so far...)
AND RAND()*#total < b.age AND (#total:=#total-b.age)
ORDER BY b.publish_date DESC
LIMIT 10;
Note that the #total decreases only if a book has passed the random-selection test, because of short-circuiting of AND expressions.
This is not guaranteed to choose 10 books in one pass -- it's not even guaranteed to choose any books on a given pass. So you have to re-run the second step until you've found 10 books. The #total variable retains its decreased value so you don't have to recalculate it.
First off I think your formula will guarantee that earlier books get picked. Try to set your initial probabilities based on:
Age - days since publication
Max(Age) - oldest book in the sample
Book Age(i) - age of book i
... Prob (i) = [Max (age) + e - Book Age (i)] / sum over all i [ Max (age) + e - Book age(i) ]
The value e ensures that the oldest book has some probability of being selected. Now that that is done, you can always recalc the prob of any sample.
Now you have to find an UNBIASED way of picking books. Probably the best way would be to calculate the cumulative distribution using the above then pick a uniform (0,1) r.v. Find where that r.v. is in the cumulative distribution and pick the book nearest to it.
Can't help you on the coding. Make sense?

Categories