Popularity Algorithm - php

I'm making a digg-like website that is going to have a homepage with different categories. I want to display the most popular submissions.
Our rating system is simply "likes", like "I like this" and whatnot. We basically want to display the submissions with the highest number of "likes" per time. We want to have three categories: all-time popularity, last week, and last day.
Does anybody know of a way to help? I have no idea how to go about doing this and making it efficient. I thought that we could use some sort of cron-job to run every 10 minutes and pull in the number of likes per the last 10 minutes...but I've been told that's pretty inefficient?
Help?
Thanks!

Typically Digg and Reddit-like sites go by the date of the submission and not the times of the votes. This way all it takes is a simple SQL query to find the top submissions for X time period. Here's a pseudo-query to find the 10 most popular links from the past 24 hours using this method:
select * from submissions
where (current_time - post_time) < 86400
order by score desc limit 10
Basically, this query says to find all the submissions where the number of seconds between now and the time it was posted is less than 86400, which is 24 hours in UNIX time.
If you really want to measure popularity within X time interval, you'll need to store the post and time for every vote in another table:
create table votes (
post foreign key references submissions(id),
time datetime,
vote integer); -- +1 for upvote, -1 for downvote
Then you can generate a list of the most popular posts between X and Y times like so:
select sum(vote), post from votes
where X < time and time < Y
group by post
order by sum(vote) desc limit 10;
From here you're just a hop, skip, and inner join away from getting the post data tied to the returned ids.

Do you have a decent DB setup? Can we please hear about your CREATE TABLE details and indices? Assuming a sane setup, the DB should be able to pull the counts you require fast enough to suit your needs! For example (net of indices and keys, that somewhat depend on what DB engine you're using), given two tables:
CREATE TABLE submissions (subid INT, when DATETIME, etc etc)
CREATE TABLE likes (subid INT, when DATETIME, etc etc)
you can get the top 33 all-time popular submissions as
SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33
and those voted for within a certain time range as
SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
WHERE likes.when BETWEEN initial_time AND final_time
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33
If you were storing "votes" (positive or negative) in likes, instead of just counting each entry there as +1, you could simply use SUM(likes.vote) instead of the COUNTs.

For stable list like alltime, lastweek, because they are not supposed to change really fast so that I think you should save the list in your cache with expiration time is around 1 days or longer.
If you concern about correct count in real time, you can check at every page view by comparing the page with lowest page in the cache.
All you need to do is to care for synchronizing between the cache and actual database.
thethanghn

Queries where the order is some function of the current time can become real performance problems. Things get much simpler if you can bucket by calendar time and update scores for each bucket as people vote.

To complete nobody_'s answer I would suggest you read up on the documentation (if you are using MySQL of course).

Related

codeigniter active record querying for results based on hours as well as by state

I have two queries ultimately I think they will be in the same context of the other but in all. I have a user database that I want to pull out for tracking records based on hour. Example registrations per hour. But in this registrations per hour I want to have the query to dump results by hour increments (or weeks, or months) ie: 1,000 regitrations in november, 1,014 in december and so on, or similar for weeks hours.
I also have a similar query where I want to generate a list of states with the counts next to them of how many users I have per state.
My issue is, I'm thinking I think to one dimensionally currently cause the best idea I can think of at the moment is making in the case of the states 50 queries, but I know thats insane, and there has to be an easier way thats less intense. So thats what Im hoping someone from here can help me with, by giving me a general idea. Cause I don't know which is the best course of action for this currently.. be it using distinct, group_by or something else.
Experiment a bit and see if that doesn't help you focus on the question a bit more.
Try selecting from your registrations per hour table and appending the time buckets you are interested in to the select list.
like this:
select userid, regid, date_time, week(date_time), year(date_time), day(date_time)
from registraions;
you can then roll up and count things in that table by using group by and an aggregate function like this:
select count(distinct userid), year(date_time)
from registraions
group by year(date_time)
Read about about date time functions:
MySQL Date Time Functions
Read about aggregate functions"
MySQL Group By

How can I store post rankings that change with time?

I'm trying to learn how to code a website algorithm like Reddit.com where there are thousands of posts that need to be ranked. Their ranking algorithm works like this (you don't have to read it, its more of a general question that I have): http://amix.dk/blog/post/19588
Right now I have posts stored in a database, I record their dates and they each have an upvotes and downvotes field so I'm storing their records. I want to figure out how do you store their rankings? When specific posts have ranking values, but they change with time, how could you store their rankings?
If they aren't stored, do you rank every post every time a user loads the page?
When would you store the posts? Do you run a cron job to automatically give every post a new value every x minutes? Do you store their value? Which is temporary. Maybe, until that post reaches its minimum score and is forgotten?
I did actually read the explanation of the ranking system and if I'm correct they do not care about the current time, but the time of submission of the post. This means the score will change on two points; 1) when the post is submitted, 2) when someone up- or downvotes the post
So you have to (re-)calculate the score when you post something, and when someone up- or downvotes. To recalculate the score isn't that heavy for the server (not at all actually), so just recalculate on vote-changes!
You would most probably use some sort of cache. In addition to post_time, up_votes, and down_votes, you'd have a current_rank, and last_ranked. If last_ranked would be more than say, 20 minutes, you'd rank it again, and if not, display the cached rank.
A different approach would be to rank and save the rank (i.e. current_rank) every time the post is upvoted or downvoted, as well as periodically (every X minutes), then you can just ask the database for the rank, at it would be (pretty) up to date.

MySQL Queries before NOW

Now I've read on this fantabulous site about how to check if the timestamp that is in your database is before now(), and that answers part of my question.
But the remainder of my question still eludes me:
How can you not only check for timestamps BEFORE now(), but also put a cap on how many before now are queried?
I have a database that is acting as a guestbook, and the output is a flash piece that looks like a post-it board. you fill out a form, and the submission immediately "tacks" itself to the post-it board. The problem is that when you first load the page, it will load every single post starting from id-0 to id-5,000,000 and so on and so forth.
I would like to put a cap on how many are loaded so the query looks at the following things:
What's the current timestamp?
Go back in time (for example) 10 entries ago
Post from THAT point on
The fields I have in my database are: id, comments, timestamp
EDIT
I'm looking at some of the answers, and I would like to ask about the LIMIT. If I put a limit on the queries, will I still be able to query posts that are PAST now?
Example: Say there are two people viewing the site at the same time. One visitor posts a comment into the database. I want person 2 to still be able to the see that comment pop up on his end.
The flash post-it area runs off a php script that queries the database and exports it into an XML format that Flash can read. Every 5 seconds, the flash re-queries the database to check for more posts.
If I put the Limit on the query, will I still be able to grab NEW entries on the fly? I just want the LIMIT to generate a starting point offset from ground zero
I think what you are looking for is called Limit
You just put it at the end of your statement and the query will return the amount of results you wanted
YOUR QUERY LIMIT 0,10
This will return 10 first results
SELECT * FROM
(SELECT ... WHERE dt<NOW() ORDER BY dt DESC LIMIT 10) a
ORDER BY a.dt ASC
or
SELECT ... WHERE dt<NOW() ORDER BY dt DESC LIMIT 10
check which is the more suitable for you.

How to calculate percentile rank for point totals over different time spans?

On a PHP & CodeIgniter-based web site, users can earn reputation for various actions, not unlike Stack Overflow. Every time reputation is awarded, a new entry is created in a MySQL table with the user_id, action being rewarded, and value of that bunch of points (e.g. 10 reputation). At the same time, a field in a users table, reputation_total, is updated.
Since all this is sort of meaningless without a frame of reference, I want to show users their percentile rank among all users. For total reputation, that seems easy enough. Let's say my user_id is 1138. Just count the number of users in the users table with a reputation_total less than mine, count the total number of users, and divide to find the percentage of users with a lower reputation than mine. That'll be user 1138's percentile rank, right? Easy!
But I'm also displaying reputation totals over different time spans--e.g., earned in the past seven days, which involves querying the reputation table and summing all my points earned since a given date. I'd also like to show percentile rank for the different time spans--e.g., I may be 11th percentile overall, but 50th percentile this month and 97th percentile today.
It seems I would have to go through and find the reputation totals of all users for the given time span, and then see where I fall within that group, no? Is that not awfully cumbersome? What's the best way to do this?
Many thanks.
I can think of a few options off the top of my head here:
As you mentioned, total up the reputation points earned during the time range and calculate the percentile ranks based on that.
Track updates to reputation_total on a daily basis - so you have a table with user_id, date, reputation_total.
Add some new columns to the user table (reputation_total, reputation_total_today, reputation_total_last30days, etc) for each time range. You could also normalize this into a separate table (reputation_totals) to prevent you from having to add a new column for each time span you want to track.
Option #1 is the easiest, but it's probably going to get slow if you have lots of rows in your reputation transaction table - it won't scale very well, especially if you need to calculate these in real time.
Option #2 is going to require more storage over time (one row per user per day) but would probably be significantly faster than querying the transaction table directly.
Option #3 is less flexible, but would likely be the fastest option.
Both options 2 & 3 would likely require a batch process to calculate the totals on a daily basis, so that's something to consider as well.
I don't think any option is necessarily the best - they all involve different tradeoffs of speed/storage space/complexity/flexibility. What you do will ultimately depend on the requirements for your application of course.
I don't see why that would be too overly complex. Generally all you would need is to add to your WHERE clause a query that limits results like:
WHERE DatePosted between #StartOfRange and #EndOfRange

Popularity Algorithm

I'd like to populate the homepage of my user-submitted-illustrations site with the "hottest" illustrations uploaded.
Here are the measures I have available:
How many people have favourited that illustration
votes table includes date voted
When the illustration was uploaded
illustration table has date created
Number of comments (not so good as max comments total about 10 at the moment)
comments table has comment date
I have searched around, but don't want user authority to play a part, but most algorithms include that.
I also need to find out if it's better to do the calculation in the MySQL that fetches the data or if there should be a PHP/cron method every hour or so.
I only need 20 illustrations to populate the home page. I don't need any sort of paging for this data.
How do I weight age against votes? Surely a site with less submission needs less weight on date added?
Many sites that use some type of popularity ranking do so by using a standard algorithm to determine a score and then decaying eternally over time. What I've found works better for sites with less traffic is a multiplier that gives a bonus to new content/activity - it's essentially the same, but the score stops changing after a period of time of your choosing.
For instance, here's a pseudo-example of something you might want to try. Of course, you'll want to adjust how much weight you're attributing to each category based on your own experience with your site. Comments are rare, but take more effort from the user than a favorite/vote, so they probably should receive more weight.
score = (votes / 10) + comments
age = UNIX_TIMESTAMP() - UNIX_TIMESTAMP(date_created)
if(age < 86400) score = score * 1.5
This type of approach would give a bonus to new content uploaded in the past day. If you wanted to approach this in a similar way only for content that had been favorited or commented on recently, you could just add some WHERE constraints on your query that grabs the score out from the DB.
There are actually two big reasons NOT to calculate this ranking on the fly.
Requiring your DB to fetch all of that data and do a calculation on every page load just to reorder items results in an expensive query.
Probably a smaller gotcha, but if you have a relatively small amount of activity on the site, small changes in the ranking can cause content to move pretty drastically.
That leaves you with either caching the results periodically or setting up a cron job to update a new database column holding this score you're ranking by.
Obviously there is some subjectivity in this - there's no one "correct" algorithm for determining the proper balance - but I'd start out with something like votes per unit age. MySQL can do basic math so you can ask it to sort by the quotient of votes over time; however, for performance reasons, it might be a good idea to cache the result of the query. Maybe something like
SELECT images.url FROM images ORDER BY (NOW() - images.date) / COUNT((SELECT COUNT(*) FROM votes WHERE votes.image_id = images.id)) DESC LIMIT 20
but my SQL is rusty ;-)
Taking a simple average will, of course, bias in favor of new images showing up on the front page. If you want to remove that bias, you could, say, count only those votes that occurred within a certain time limit after the image being posted. For images that are more recent than that time limit, you'd have to normalize by multiplying the number of votes by the time limit then dividing by the age of the image. Or alternatively, you could give the votes a continuously varying weight, something like exp(-time(vote) + time(image)). And so on and so on... depending on how particular you are about what this algorithm will do, it could take some experimentation to figure out what formula gives the best results.
I've no useful ideas as far as the actual agorithm is concerned, but in terms of implementation, I'd suggest caching the result somewhere, with a periodic update - if the resulting computation results in an expensive query, you probably don't want to slow your response times.
Something like:
(count favorited + k) * / time since last activity
The higher k is the less weight has the number of people having it favorited.
You could also change the time to something like the time it first appeared + the time of the last activity, this would ensure that older illustrations would vanish with time.

Categories