Item rankings, order by confidence using Reddit Ranking Algorithms - php

I am interested to use this ranking class, based off of an article by Evan Miller to rank a table I have that has upvotes and downvotes. I have a system very similar to Stack Overflow's up/down voting system for an events site I am working on, and by using this ranking class I feel as though results will be more accurate. My question is how do I order by the function 'hotness'?
private function _hotness($upvotes = 0, $downvotes = 0, $posted = 0) {
$s = $this->_score($upvotes, $downvotes);
$order = log(max(abs($s), 1), 10);
if($s > 0) {
$sign = 1;
} elseif($s < 0) {
$sign = -1;
} else {
$sign = 0;
}
$seconds = $posted - 1134028003;
return round($order + (($sign * $seconds)/45000), 7);
}
I suppose each time a user votes I could have a column in my table that has the hotness data recalculated for the new vote, and order by that column on the main page. But I am interested to do this more on-the-fly incorporating the function above, and I am not sure if that is possible.
From Evan Miller, he uses:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
But I rather not do this calculation in the sql as I feel this is messy and difficult to change down the line if I utilize this code on multiple pages .etc.

Accessing the corresponding "Posts" table for anything (reading, writing, sorting, comparing, etc.) is extremely quick and thus relying on the database is the "most on-the-fly" alternative you have for non-temporary data storage (memory/sessions are still quicker but, logically, cannot be used to store this information).
You should be more worried about building a good ranking algorithm delivering the results you want (you are proposing two different systems, delivering different results) and working on making the whole code and the code-database communication as efficient as possible.
In principle, small codes with iterative simple orders offer the quickest and most reliable solution for this kind of situations. Example:
Ranking function (like the first one you propose or any
other one built on the ranking rules you want) called every time a
vote is given. It writes to the corresponding column(s) in the
"Posts" table (the simpler the query, the better: you can create a
ranking system as complex as you wish, but try to rely on PHP
rather than on queries).
Every time a comparison between posts is required, the "Posts" table is read with a simple SELECT ordering the records by ranking
(you can have various "assessing columns" (e.g., up-votes,
down-votes, further considerations); but better having one with the
definitive ranking).

You are right, query like this is rather messy and expensive as well.
Mixed PHP/MySQL on the fly is a bad idea well as you will have to select values for all posts and calculate hotness and then select a list of hotest ones. Extremely expensive.
You should consider saving at least part of your calculation to the database. Definitely order should go to the database. It's always better to calculate something and save just once on every save/update, instead of calculating each time it will be displayed. Try to do a benchmark on how much time you will save by calculating order on save/update instead of every time you calculate the hotness. Good thing is that order never changes unless someone upvotes/downvotes which you save to the db anyway, same for the sign.
Even if you save the sign to the db you are stil not able to avoid calculating on the fly due to the posted timestamp parameter.
I would see what difference does it make and where it makes a difference and calculate hotness with a CLI script every x amount of time only for those scripts where this is crucial, every y amount of time where it's making less of a difference.
Taking this approach you will be recalculating hotness only when necessary. This will make your application much more efficient.

I am not sure if it is possible with your DB and Schema however have you consider writing a UDF for custom sorting?
A post from stackoverflow talks about how to do this here.

Related

Full text document similarity search

I have big database of articles and I'd like before adding new items to DB check if already similar items exist and if so - group them together, so that later I can easily display them as a group of similar items.
Currently we use very simple, but shockingly very precise and our needs fully satisfying PHP's similar_text() function. The problem is, that before we add an item to DB, we first need to pull X amount of items from DB to then loop through every single one in order to check whether our new item is at least 75% similar to other items in order to group them together. This uses a lot of resources and time that we don't really have.
We use MySQL and Solr for all our queries. I've tried using MySQL Full-Text Search, Solr More like this. Compared to PHPs implementation, they are super fast and efficient, but I just can't get a robust percentage score which PHP similar_text() provides. It is crucial for our grouping to be accurate.
For example using this MySQL query:
SELECT id, body, ROUND(((MATCH(body) AGAINST ('ARTICLE TEXT')) / scores.max_score) * 100) as relevance
FROM natural_text_test,
(SELECT MAX(MATCH(body) AGAINST('ARTICLE TEXT')) as max_score FROM natural_text_test LIMIT 1) scores
HAVING relevance > 75
ORDER BY relevance DESC
i get that article with 130 words is 85% similar with another article with 4700 words. And in comparison PHP's similar_text() returns only 3% similarity score which is well below our threshold and is correct in our case.
I've also looked into Levenshtein distance algorithm, but it seems that the same problem as with MySQL and Solr arises.
There has to be a better way to handle similarity checks, maybe I'm using the algorithms incorrectly?
Based on some of the Comments, I might propose this...
It seems that 75%-similar documents would have a lot of the same sentences in the same order.
Break the doc into sentences
Take a crude hash of each sentence, map it to a visible ascii character. This gives you a string that is, perhaps, 1/100th the size of the original doc.
Store that with the doc.
When searching, use levenshtein() on this string to find 'similar' documents.
Sure, hashing is imperfect, etc. But this is fast. And you could apply some other technique to double-check the few docs that are close.
For a hash, I might do
$md5 = md5($sentence);
$x = somehow get 6 bits out of that hex string
$hash = chr(ord('0' + $x));

Event entries scheduling algorithm PHP

we have an entry portal system in which we accept entries for events.
For example Championship Event 2017 will be held on 30th Nov.
Event got about 150 entries in different classes. Junior Class, Senior Class, Pro Class etc.
Now event venue only has certain numbers of ground on which the competition can be held. For example Ground 1, Ground 2 and Ground 3. Its a solo performance event.
Now our system needs to generate a schedule in such a way that competitors who entered multiple classes or same classes multiple times should get maximum break between their performances.
The input data we have are registration under each class.
Starting time of each Ground. For example Ground A will start at 8:00 AM, Ground 2 at 8:00 and Ground 3 at 9:00.
We also know that which class will be held in which arena. For example Junior and senior Class will be held in Ground 1 and Pro Class will be held in Ground 2.
We know the performance time as well. Senior Class 1 performance is 5 minutes. Junior class performance is 7 minutes and Pro Performance is 9 minutes.
Now I have written following code to get the schedule so that competitors competing multiple times in one class or in multiple class get maximum break between their performance but it still puts same competitor performance one after another.
Let me know what is my mistake.
foreach ($totalPerformanceTimeSlot as $time => $performance) {
# $totalPerformanceTimeSlot is array of timeslots starting from 8:00 am
foreach ($performance as $classId) {
#there could be 2 performance at the same time in different arena for different class.
$totalPerformanceLeftThisClass = count($this->lassRegistrationLinks[$classId]); //Get the total performance for this class from array;
# $accountRidesLeftArray has value of how many times each account is performing in this class
arsort($accountRidesLeftArray);
# for each person, estimate what their start time threshold should be based on how many times they're performing
$accountPerformanceTimeThreshold = array();
foreach ($accountPerformanceLeftArray as $accountId => $accountPerformancesLeft) {
$tempPerformanceThreshold = 20 * 60;
# reduce this person's performance threshold by a performance at a time until the minimum performance threshold has been met
while ((($totalPerformanceLeftThisClass * $this->classes[$classId]['performanceTime']) / $accountPerformanceLeft < $tempPerformanceThreshold) && ($tempPerformanceThreshold > $this->minRideThreshold))
$tempPerformanceThreshold -= $this->classes[$classId]['performanceTime'];
$accountPerformanceTimeThreshold[$accountId] = $tempPerformanceThreshold;
}
$performanceLeft = $totalPerformanceLeftThisClass - $count;
# given the number of performance left in the class,
# calculate how important it is per account that they get placed in the next slot
$accountToPerformNextImportanceArray = array();
$timeLeft = $performanceLeft * $this->classes[$classId]['performanceTime'];
foreach ($accountPerformanceLeftArray as $accountId => $accountPerformancesLeft) {
# work out the maximum number that can be used as entropy
$entropyMax = (20 * 60 / ($timeLeft / 1)) * 0.5;
$entropy = ((mt_rand (0, $entropyMax * 1000)) / 1000);
# the absolute minimum amount of time required for this user to perform
$minTimeRequiredForComfortableSpacing = ($accountRidesLeft - 1) * 20* 60;
# add a bit of time around the absolute minimum amount of time required for this person to perform so that it doesn't instantly snap in when this person suddenly has the minimum amount of time left to perform
$generalTimeRequiredForComfortableSpacing = $minTimeRequiredForComfortableSpacing * 1.7;
$nearestPerformancePrior = $this->nearest_performance_prior($classDetails['date'], $currentTime, $accountId);
$nearestRideAfter = $this->nearest_performance_after($classDetails['date'], $currentTime, $accountId);
# work out how important it is for this rider to ride next based on how many rides they have left
$importanceRating = 20 * 60 / ($timeLeft / $accountPerformanceLeft);
# if there's more than enough time left then don't worry about giving this person any importance rating, ie. it's not really important that they perform straight away
if ($timeLeft > $generalTimeRequiredForComfortableSpacing)
$importanceRating = 0;
# add a little bit of random entropy to their importance rating
$importanceRating += $entropy;
# if this account has performed too recently to place them here in this slot, then make them very undesirable for this slot
if ((!is_null($nearestPerformancePrior)) && ($nearestPerformancePrior > $currentTime - $accountPerformanceTimeThreshold[$accountId]))
$importanceRating = -1;
# work out if this account will perform too soon afterwards to place them here in this slot, then make them very undesirable for this slot
if ((!is_null($nearestRideAfter)) && ($nearestRideAfter < $currentTime + $accountRideTimeThreshold[$accountId]))
$importanceRating = -1;
$accountToPerformNextImportanceArray[$accountId] = $importanceRating;
}
arsort($accountToPerformNextImportanceArray);
//Then I take the first one from this array and allocate the time for that user.
$this->set_performance_time($classDetails['date'], $accountId, $currentTime);
$currentTime += $this->classes[$classId]['performanceTime'];
}
}
Here is some explanation of the variables
$accountPerformancessLeft is total number of performance for each user.
For e.g. if user has entered into 2 classes that means $accountPerformancessLeft is 6 for that user.
threshold is something like break.
Rider and account is conceptually the same.
I know it is hard to think the output without the actual data but any help would be appreciated.
Thank you
Well, first let's see what we have and simplify the problem:
There are different competitions(events) but since they are independent to each other we can consider only one
We have C different classes (senior, junior, ...)
We have G different grounds that each ground may hold some of C classes.
There are some persons(competitor) lets say P who registers to C classes.
Persons need to have maximum possible break.
So putting them all together the problem is:
The are some grounds G = {g1, g2, ..., gm} that each of them contains some persons P = {p1, p2, ..., pn}. We want to maximize the break time of each person in all of its competitions.
The trivial case:
First, let's assume that there is only one ground g1, and a group of person P = {p1, p2, ..., pn} who wants to compete on this ground. let's define a boolean method isItPossible(breaktime) showing that whether it is possible to schedule the competition that each person has at least breaktime to rest or not. we can simply prove that this method is monotonic i.e. if there exist a breaktime that isItPossible(breaktime) became true then:
isItPossible(t) = true for every t <= breaktime
So we can use binary search to find the maximum value for breaktime. Here is the pseudo code (C++ syntax like):
double low = 0 , high = INF;
while(low < high){
mid = (low + high) / 2;
if(isItPossible(mid))
low = mid;
else
high = mid;
}
breakTime = low;
Now the only thing remains is implementing isItPossible(breaktime) method. There are a lot of ways to implement it but I use greedy algorithm and a heap priority queue to solve it. We need a priority queue for maintaining some tuples. Each tuple contains a person, the number of time that person should compete and the earliest time we can schedule a competition for that person. We start from time t0 (the opening time of the ground e.g. it could be 8.00 a.m.) and each time we pick a person from the priority queue with the minimum earliest time. Here is the C++ like pseudo code:
bool isItPossible(double breaktime){
//Tuple(personId, numberOfCompete, earliestTime)
priority_queue<Tuple> pq;
for p in Person_list
pq.push(Tuple(p,countCompetition(p),t0));
for(time = t0;time<end_of_ground_time;){
person = pq.pop();
add_person_to_scedule_list(person.personId, max(time, person.earliestTime));
time = max(time, person.earliestTime) + competition_time;
if(person.numberOfCompete > 1)
pq.push(Tuple(person.Id,person.numberOfCompete - 1,time + breaktime)));
}
return pq.isEmpty();
}
The main problem:
After solving the trivial case we are ready to solve the original problem. In this case there are G = {g1, g2, ..., gm} grounds and we want to schedule P = {p1, p2, ..., pn} competitors. Like the trivial case we define a isItPossible(breaktime) function. Again we can prove that this function in monotonic, so we use binary search for finding the maximum value (like the above code). After that we only need to implement the isItPossible(breaktime) method. In this case implementing this method is little tricky.
For this method you can do some heuristic algorithms or some creative greedy ones (for example distribute each person start time base on breakTime over all grounds and check whether it is possible to do it for all persons or not). But again I suggest you to use Greedy algorithm and priority queue like the trivial case. Your tuple should also contains number of times that person compete in each ground, and when you want to increase time and sweep it, you should iterate over all grounds and schedule them simultaneously.
Hope it can help you. Of course there are some evolutionary algorithms like genetic or PSO to solve it (I can also explain them if you want) But using the above method is much simpler to implement and debug.
What an interesting problem!
Here is how I'd tackle it:
Set up a random schedule which works (but doesn't fit the criteria).
Write a function that can swap 2 performances
Write a shuffler which uses swap() many times in order to get a new timetable
Write a function to calculate a score(), how good is this particular schedule? does it have a lot of breaks between performances?
A score should sum all the performance gaps together, this is the function we want to maximise.
Write an algorithm that takes a "search" approach and backtracking to the problem, and let it run for a couple hours, the backtracking should:
Swap stuff
See if the swapped stuff has a better score
if so, continue from swapped
otherwise, backtrack
It can take a while, but the program can generate a better timetable.
Let us know if this approach helps.

Efficient way of emulating LIMIT (FETCH), OFFSET in Progress OpenEdge 10.1B SQL using PHP

I want to be able to use the equivalent of MySQL's LIMIT, OFFSET in Progress OpenEdge 10.1b.
Whilst the FETCH/OFFSET commands are available as of Progress OpenEdge 11, unfortunately version 10.1B does not have them, therefore it is difficult to produce paged recordsets (e.g. Records 1-10, 11-20, 21-30 etc.).
ROW_NUMBER is also not supported by 10.1b. Seems that it is pretty much the same functionality as was found in SQL Server 2000.
If searching always in the order of the primary key id (pkid), this could be achieved by using "SELECT TOP 10 * FROM table ORDER BY pkid ASC", then identifying the last pkid and finding the next set with "SELECT TOP 10 * FROM table WHERE pkid>last_pkid ORDER BY pkid ASC"; this, however only works when sorting by the pkid.
My solution to this was to write a PHP function where I could pass the limit and offset and then return only the results where the row number was between my those defined values. I use TOP to return no more than the sum of the limit and offset.
function limit_query($sql, $limit=NULL, $offset=0)
{
$out = array();
if ($limit!=NULL) {
$sql=str_replace_first("SELECT", "SELECT TOP ".($limit+$offset), $sql);
}
$query = $db->query($sql); //$db is my DB wrapper class
$i=0;
while ($row = $this->fetch($query)) {
if ($i>=$offset) { //only add to return array if greater than offset
$out[] = $row;
}
$i++;
}
$db->free_result($query);
return $out;
}
This works well on small recordsets or on the first few pages of results, but if the total results are in the thousands, if you want to see results on page 20, 100 or 300, it is very slow and inefficient (Page one is querying only the first 10 results, page 2 the first 20 but page 100 will query the first 1000).
Whilst in most cases, the user will probably not venture past page 2 or 3, so the lack of efficiency isn't perhaps a major issue, I do wonder if there is a more efficient way of emulating this functionality.
Sadly, upgrading to a newer version of Progress, or a superior database such as MySQL is not an option, as the db is provided by third-party software.
Can anyone suggest alternative, more efficient methods?
I am not sure I fully understand the question, so here's an attempt to give you an answer:
You probably won't be able to do what you want with a single hit to the db. Just by sorting records / adding functions you probably won't achieve the paging functionality you are trying to get. As far as I know, Progress won't number the rows, unless, as you said, you're sorting by some crescent pkid.
My suggestion to you would be a procedure to run in the back end to create the query with a batch size same as the page (in your case 10), and use a loop to get the next batch until you get the ones you need. Look into batching datasets or use an open query using MAX-ROWS.
Hope it helps, or at least gives you an idea to get this. I actually like your PHP implementation, it seems like a good workaround, not ugly to keep.
You should be able to install an upgraded version of Progress, convert your database(s) and recompile the code against the new version. Normally your support through your vendor would provide you with the latest version of Progress (Openedge) and wouldn't be a huge issue. Going from version 10 to 11 shouldn't cause any compile issues and give you all of the SQL benefits of the newer version.
Honestly your comment about MySql being superior is a little confusing, but that's a discussion for another day. ;D
Best regards!

Time Prediction based on existing date:time records

I have a system that logs date:time and it returns results such as:
05.28.2013 11:58pm
05.27.2013 10:20pm
05.26.2013 09:47pm
05.25.2013 07:30pm
05.24.2013 06:24pm
05.23.2013 05:36pm
What I would like to be able to do is have a list of date:time prediction for the next few days - so a person could see when the next event might occur.
Example of prediction results:
06.01.2013 04:06pm
05.31.2013 03:29pm
05.30.2013 01:14pm
Thoughts on how to go about doing time prediction of this kind with php?
The basic answer is "no". Programming tools are not designed to do prediction. Statistical tools are designed for that purpose. You should be thinking more about R, SPSS, SAS, or some other similar tool. Some databases have rudimentary data analysis tools built-in, which is another (often inferior) option.
The standard statistical technique for time-series prediction is called ARIMA analysis (auto-regressive integrated moving average). It is unlikely that you are going to be implementing that in php/SQL. The standard statistical technique for estimating time between events is Poisson regression. It is also highly unlikely that you are going to be implementing that in php/SQL.
I observe that your data points are once per day in the evening. I might guess that this is the end of some process that runs during the day. The end time is based on the start time and the duration of the process.
What can you do? Often a reasonable prediction is "what happened yesterday". You would be surprised at how hard it is to beat this prediction for weather forecasting and for estimating the stock market. Another very reasonable method is the average of historical values.
If you know something about your process, then an average by day of the week can work well. You can also get more sophisticated, and do Monte Carlo estimates, by measuring the average and standard deviation, and then pulling a random value from a statistical distribution. However, the average value would work just as well in your case.
I would suggest that you study a bit about statistics/data mining/predictive analytics before attempting to do any "predictions". At the very least, if you really have a problem in this domain, you should be looking for the right tools to use.
As Gordon Linoff posted, the simple answer is "no", but you can write some code that will give a rough guess on what the next time will be.
I wrote a very basic example on how to do this on my site http://livinglion.com/2013/05/next-occurrence-in-datetime-sequence/
Here's a possible way that this could be done, using PHP + MySQL:
You can have a table with two fields: a DATE field and a TIME field (essentially storing the date + time portion separately). Say that the table is named "timeData" and the fields are:
eventDate: date
eventTime: time
Your primary key would be the combination of eventDate and eventTime, so that they're never repeated as a pair.
Then, you can do a query like:
SELECT eventTime, count(*) as counter FROM timeData GROUP BY eventTime ORDER BY counter DESC LIMIT 0, 10
The aforementioned query will always return the first 10 most frequent event times, ordered by frequency. You can then order these again from smallest to largest.
This way, you can return quite accurate time prediction results, which will become even more accurate as you gather data each day

Popularity Algorithm

I'd like to populate the homepage of my user-submitted-illustrations site with the "hottest" illustrations uploaded.
Here are the measures I have available:
How many people have favourited that illustration
votes table includes date voted
When the illustration was uploaded
illustration table has date created
Number of comments (not so good as max comments total about 10 at the moment)
comments table has comment date
I have searched around, but don't want user authority to play a part, but most algorithms include that.
I also need to find out if it's better to do the calculation in the MySQL that fetches the data or if there should be a PHP/cron method every hour or so.
I only need 20 illustrations to populate the home page. I don't need any sort of paging for this data.
How do I weight age against votes? Surely a site with less submission needs less weight on date added?
Many sites that use some type of popularity ranking do so by using a standard algorithm to determine a score and then decaying eternally over time. What I've found works better for sites with less traffic is a multiplier that gives a bonus to new content/activity - it's essentially the same, but the score stops changing after a period of time of your choosing.
For instance, here's a pseudo-example of something you might want to try. Of course, you'll want to adjust how much weight you're attributing to each category based on your own experience with your site. Comments are rare, but take more effort from the user than a favorite/vote, so they probably should receive more weight.
score = (votes / 10) + comments
age = UNIX_TIMESTAMP() - UNIX_TIMESTAMP(date_created)
if(age < 86400) score = score * 1.5
This type of approach would give a bonus to new content uploaded in the past day. If you wanted to approach this in a similar way only for content that had been favorited or commented on recently, you could just add some WHERE constraints on your query that grabs the score out from the DB.
There are actually two big reasons NOT to calculate this ranking on the fly.
Requiring your DB to fetch all of that data and do a calculation on every page load just to reorder items results in an expensive query.
Probably a smaller gotcha, but if you have a relatively small amount of activity on the site, small changes in the ranking can cause content to move pretty drastically.
That leaves you with either caching the results periodically or setting up a cron job to update a new database column holding this score you're ranking by.
Obviously there is some subjectivity in this - there's no one "correct" algorithm for determining the proper balance - but I'd start out with something like votes per unit age. MySQL can do basic math so you can ask it to sort by the quotient of votes over time; however, for performance reasons, it might be a good idea to cache the result of the query. Maybe something like
SELECT images.url FROM images ORDER BY (NOW() - images.date) / COUNT((SELECT COUNT(*) FROM votes WHERE votes.image_id = images.id)) DESC LIMIT 20
but my SQL is rusty ;-)
Taking a simple average will, of course, bias in favor of new images showing up on the front page. If you want to remove that bias, you could, say, count only those votes that occurred within a certain time limit after the image being posted. For images that are more recent than that time limit, you'd have to normalize by multiplying the number of votes by the time limit then dividing by the age of the image. Or alternatively, you could give the votes a continuously varying weight, something like exp(-time(vote) + time(image)). And so on and so on... depending on how particular you are about what this algorithm will do, it could take some experimentation to figure out what formula gives the best results.
I've no useful ideas as far as the actual agorithm is concerned, but in terms of implementation, I'd suggest caching the result somewhere, with a periodic update - if the resulting computation results in an expensive query, you probably don't want to slow your response times.
Something like:
(count favorited + k) * / time since last activity
The higher k is the less weight has the number of people having it favorited.
You could also change the time to something like the time it first appeared + the time of the last activity, this would ensure that older illustrations would vanish with time.

Categories