I'm planning on integrating a reasonable ranking/voting system into an existing application.
I'm familiar with how the traditional 5 star rating systems work and know the common pitfalls/problems associated with them therefore was wondering if there is other ways (I've heard of Wilsons, Bayesian etc. but not really sure on how to implement this with the below structure):
I'm planning on allowing users to vote on content between 1 to 10 via the contents page.
The score and total votes for that content will be displayed on the contents page.
I will also be displaying/listing the Top 10 Content so I'd need the method to be fair/realistic and not make a vote of 10 with total votes of 1 to go straight to number 1.
I'm using PHP and MySQL, I have a table for the content (which has a content_id which I guess I can JOIN on).
I'm wondering if you can suggest a way/method which achieves the above, I'd appreciate if you can attach some example PHP code and example MySQL schema so I can better understand it, as I've google'd and may have found potential solutions such as Wilsons and Bayesian...yet they provide a lengthy article with confusing mathematical equations - and mention no way which achieves the above (ie. the score....and implenting the method in PHP/MySQL) or atleast due to there not being any example PHP/MySQL code me misunderstanding this.
Perhaps this is easier then I think - I don't know as I've never had the need to implement this sort of "more complex" ranking/voting functionality before - so I'd appreciate your responses.
You should start by watching this video on youtube : Building Web Reputation Systems.
To emphasize the point, let me direct you to XKCD.
As for DB structure, you need following parts:
list of items ( with total_votes column )
list of user, which have voted
intersection table for the items-users ( with rating column, if you go with 5star thing )
Related
I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?
For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)
Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.
I am running a website that lets users contribute by letting them upload files on specific subjects. Right now my rating system is the worst possible (number of downloads of the file). Not only is this highly inaccurate in terms of quality control but also does it prevent new content to become listed on top anytime soon.
This is why I want to change my rating system so that users can up-/down-vote each item. However this should not be the only factor to display the popularity of such item. I would like to have older content to decrease in rating over time. Maybe I could even factor in the amount of downloads but to a very low percentage.
So, my questions are:
Which formula would you suggest under the assumption that there is 1 new upload every day?
How would you implement this in a php/mysql environment?
My problem is that right now I am simply sorting my stuff by the downloads row in the database. How can I sort a query by a factor that is calculated externally (in php) or do I have to update a new row in my table with the rating factor each time someone calls the site in his browser?
(Please excuse any mistakes, I am not a native speaker)
I am not really fluent in php or mysql, but as for the rating system, if you want to damp things in time, have you considered a decaying exponential? Off the top of my head, I would probably do something like
$rating = $downloads * exp(-1*$elapsedTime)
you can read up on it here http://en.wikipedia.org/wiki/Exponential_decay. Maybe build in a one week or one month or something delay before you starting damping the results, or people are going to get their upload downrated immediately.
First of all, in any case, you will need to add at least one column to your table. The best thing would be to have a separate table with id, upvotes, downvotes, datetime
If you want to take in consideration the freshness of posts (or uploads or comments or...) I think the best actual method is Wilson score with a gravity parameter.
For a good start with Wilson score implementation in PHP, check this.
Then you will need to read this to understand the pros and the cons of other solutions and use SQL directly.
Remark: gravity is not explicitly detailed in the SQL code but thanks to the PHP one you should be able to make it work.
Note that if you would like something simpler but still not lame, you could check with Bayesian Average. IMDB uses Bayesian Estimation to calculate its Top 250.
Implementing your own statistical model will only results in drawbacks that you had not imagined first (too far from the mean, downvotes are more important than upvotes, decay too quickly, etc...)
Finally you are talking about rating uploads directly, not the user who uploads the files. If you would like to do the same with the user, the simpler would be to use a Bayesian estimate with the results from your uploads ratings.
You have a lot to read, just in StackOverflow, to dry the subject.
Your journey starts here...
I am wanting to do something similar to this:
http://www.dimarzio.com/pickup-picker
My question involves the concept rather than any specific code on how to execute this.
For example, we are selling violins and we want the user to input info about their playing style, and give them the three best violins based on their entry. This is the data I've been given:
So if the user inputs Expert, Hard, Rock, and Dark I will get data sets of violins consisting of: Cannon, Soil, Ysaye, K.Joseph, Heifetz // Cannon, Kreisler, Soil, Heifetz // Kreisler, Diable, Vieuxtemps // Cannon, Diable, Plowden
Out of those I need to output to the user the three best choices for them. Cannon is listed in 3 out of the 4, so that has to be #1. Now there are three more violins that match two of the four criteria. The Soil, Kriesler and Diable. In order to drill that down to two choices, I would think the questions would have to be ranked according to importance. For instance Tone is most important, followed by bowing style, musical genre, and skill level. Based on that ranking, the program should choose the Diable and Kreisler.
I am not entirely sure how to approach this. Since this data will not change frequently, should this even get the database involved? Should the info just be stored in a multi-dimensional array? Once the data is in an array, whether from the DB or not, how should I go about programming the logic to examine the arrays in order of importance and grab the violins that are most relevant?
Any help is much appreciated! I figured this was going to be easy, until I actually started thinking about it!
To me this sounds like a sorting problem. I don't know anything about violins so I'm unable to absorb much from your example, but anyway...
You're probably familiar with how a database sorts across multiple columns. If I said order by firstname, lastname, phone it would compare the firstnames, and only if theres a tie, would it then compare the last names, and again if there's a tie, then it would compare the phone numbers.
Once sorted, you pick the top N entries and display.
You can do custom sorting like this in php code too. For example, you would want to order by num occurances in a list, tone, bowing style, etc...
Thats the gist of it. I would store it in a database merely because its data and for the most part, its a great place to keep it. Plenty of import export and other data management, viewing, editing and other functionality freebies come with using a database.
If you need some sample code that mimics the database order by clause, I can dig some up I know I have somewhere.
We have a start up company that solves the issue you are outlining. Basically we have created a semantically enabled product selector which guides users through a selection process to find a product or a solution.
Although we have designed our product for a different market sector (not vioins), I think it would help to solve the issue you describe.
The data is hosted on Amazon AWS and we have built an API so the product selector can be incorporated into iPhone apps, Android apps, websites etc.
If you want, our website www.productworld.com where you will get my contact details.
I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.
I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:
First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:
Each time a post is added, has a tag added, commented on or liked I add it to a queue.
I process this queue via cron (once a day), finding out the relevant information for each post, e.g. user_id's of the commenters and likers and tag_id's. I save this information to MongoDB in this kind of structure: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87,6]}. This allows me to eventually build up a MongoDB collection which gives me easy and quick access to all of the relevant information for when I try to calculate similarities
I then run another cron script (once a day also, but after the previous) which goes through the queue again. This time, for each post in the queue, I grab their entry from the MongoDB collection and compare it to all of the other entries. When 2 entries have some matching information, I give them +1 in terms of similarity. In the end I have an overall score for each pair of posts. I save the scores to a different MongoDB collection with the following structure: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' is a key=>value array with the post_id as key and the similarity score as the value. I don't save a score if it is 0.
I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:
The script first decides, which type of recommendation the user will get. It's a 50-50 change of - 1. A post similar to one of your posts or 2. A post similar to a post you have interacted with.
If 1, then the script grabs the users post_ids from MySQL, then uses them to grab their similar posts from MongoDB. The script takes the post that is most similar and has not yet been recommended to the user.
If 2, the script grabs all of the posts the user has commented on or liked from MySQL and uses their ids to do the same in 1 above.
Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.
I'm just wondering if anyone feels I could be approaching this any better?
With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).
Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.
Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.
Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.
And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.
It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.
I'm starting to plan how to do this.
First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.
Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.
Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).
I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.
For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)
Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').
Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.
I was thinking about an idea of auto generated answers, well the answer would actually be a url instead of an actual answer, but that's not the point.
The idea is this:
On our app we've got a reporting module which basically show's page views, clicks, conversions, details about visitors like where they're from, - pretty much a similar thing to Google Analytics, but way more simplified.
And now I was thinking instead of making users select stuff like countries, traffic sources and etc from dropdown menu's (these features would be available as well) it would be pretty cool to allow them to type in questions which would result in a link to their expected part of the report. An example:
How many conversions I had from Japan on variant (one page can have many variants) 3.
would result in:
/campaign/report/filter/campaign/(current campaign id they're on)/country/Japan/variant/3/
It doesn't seem too hard to do it myself, but it's just that it would take quite a while to make it accurate enough.
I've tried google'ing but had no luck to find an existing script, so maybe you guys know anything alike to my idea that's open source and well reliable/flexible enough to suit my needs.
Thanks!
You are talking about natural language processing - an artificial intelligence topic. This can never be perfect, and eventually boils down to the system only responding to a finite number of permutations of one question.
That said, if that is fine with you - then you simply need to identify "tokens". For example,
how many - evaluate to count
conversations - evaluate to all "conversations"
from - apply a filter...
japan - ...using japan
etc.