I have an application that stores data in database. I need search functionality to work on this database.
For this to work I need a "relevance" score, a score that is calculated based on a set of criteria to output as a value that can be then used to order a set of data.
Say for instance the user enters three keywords: X, Y and Z - I need to generate a score based on a database entry. I wish the criteria to be related to how many times each appears.
Example:
Database Entry A - X appears 8 times Y appears once and Z appears once. Giving a collective score of 10.
Database Entry B - X appears 24 times Y does not appear and Z does not appear. Giving a collective score of 24.
Here's my problem. Database Entry A IS more relevant based on the search of XYZ because it has all three database entries, not just one, yet a standard calculation would class Database Entry B as more relevant.
I need to figure out a way to calculate the results and give an number score to the result based on not just how many of each keyword appears, but also giving higher scores for those results that have more than one keyword displayed, exponentially (i.e. entering 10 keywords would show results where all 10 appear above ones with large amounts of one).
I need to achieve this with PHP which will be retrieving my database results and feeding them back to my website page.
You could compute two relevance scores. One that rates based on how many fields provided a match, and then your regular "how matches were found". From your examples, that would provide:
Example A - field_count: 3, match_count: 10
Example B - field_count: 1, match_count: 24
and then have your query do
ORDER BY field_count, match_count
so that matches with more fields get sorted first.
Since the (first) presence of a keyword is so important, give it a better score than the rest of the occurrences. For example:
$score = 0;
foreach ($keywords as $count) {
$score += $count==0 ? 0 : 1000000;
$score += $count;
}
If you apply this algorithm to your example, you will have:
Entry1 ---> (1000000 + 8) + (1000000 + 1) + (1000000 + 1) = 3000010
Entry2 ---> (1000000 + 24) = 1000024
So Entry1 scores better than Entry2 as you wanted.
Related
I want to display 10 lines of the "questions" table with probability 0.2 of question that have type_id = 1 and probability 0.8 of question that have type_id =2.
Below my request, how to add the probability?
$query = "select * from questions ORDER BY RAND() LIMIT 10";
I want to display 10 questions which 20% of the questions have type_id = 2 and 80% have type_id = 1.
can someone help me please?
As I noted in the comments, you won't be able to use anything as obvious as ORDER BY RAND() if you want to include probabilities or anything like that. ORDER BY RAND() simply doesn't support that kind of thing. ORDER BY RAND() is also very slow, and not really suitable for use on a database of any significant size anyway.
There are a whole bunch of approaches you can use to do a random sort order with weighting or probabilities; I'm not going to try to discuss them all; I'll just give you a relatively simple one, but please be aware that the best technique for you will depend on your specific use case.
A simple approach would be something like this:
Create a new integer field on your table called weight or something similar.
Add a DB index for this field to enable you to query it quickly.
Set the first record to a value equal to its weighting as a whole number. ie a probability of 0.2 could be a weight of 20.
Set each subsequent record to the max value of this field plus the weight for that record. So if the second record is also 0.2, it would get a value of 40; if the one after that is only 0.1, it would be 50; and so on.
Do likewise for any new records that get added.
Now you can select a random record, with different weights for each record, as follows:
SELECT * FROM questions
WHERE weight >= FLOOR(RAND() * (SELECT MAX(weight) FROM questions))
ORDER BY weight
LIMIT 1
(note, I'm writing is answer in a hurry and without resource to test it; I haven't run this query so I may have got the syntax wrong, but the basic technique is sound)
This will pick a random number between zero and the largest weight value, and then find the question record that has the closest weight value to that random number.
Also, because the weight field is indexed, this query will be quick and efficient.
Downsides of this technique: It assumes that the weights for any given record won't change. If the weight of a record does need to change, then you would have to update the weight value for every record after it in the index.
[EDIT]
Let's imagine a table like this:
id Name
1 Question One
2 Question Two
3 Question Three
4 Question Four
5 Question Five
In this example, we want Questions 1 and 2 to have a probability of 0.2, question 3 to have a probability of 0.1 and questions 4 and 5 to have a probability of 0.3. Those probabilities can be expressed as integers by multiplying them by 100. (multiply by 10 also works, but 100 means we can have probabilities like 0.15 as well)
We add the weight column and the index for it, and set the weight values as follows:
id Name Weight
1 Question One 20
2 Question Two 40 (ie previous value + 20)
3 Question Three 50 (ie previous value + 10)
4 Question Four 80 (ie previous value + 30)
5 Question Five 110 (ie previous value + 30)
Now we can run our query.
The random part of the query FLOOR(RAND() * (SELECT MAX(weight) FROM questions)) will select a value between zero and 110. Let's imagine it gives 68.
Now the rest of our query says to pick the first record where the weight is greater than 68. In this case, that means that the record we get is record #4.
This gives us our probability because the random number could be anything, but is more likely to select a given record if the gap between its weight and the one before it is larger. You'll get record #4 three times as often as record #3.
I want to make an order form where user can choose a clothing size. I have all the available sizes stored in a database, as code, size and category.
size is formatted like 66/68/76 (Waist/Hips/Leg Length). User is able to input these three values. If user's size is available - there's no problem. But if it's not I want the site to offer or change it to nearest available size. For example if user entered 65/66/74 and exact value doesn't exist (or unavailable right now) it will be changed to 66/65/74.
You need to define what you mean by "closest". Along the way, you should also store the three values in three different columns. Storing multiple values in a single column is a bad idea.
Sometimes, one is stuck with a particular data format because of someone else's poor design decisions.
One perhaps reasonable measure is Euclidean distance -- the sum of the squares of each component. You can calculate this in MySQL:
select t.*
from (select t.*,
substring_index(size, '/', 1) as waist,
substring_index(substring_index(size, '/', 2), '/', -1) as hips,
substring_index(size, '/', -1) as legs
from t
) t
order by pow(waist - $waist, 2) + pow(hips - $hips, 2) + pow(legs - $legs, 2)
limit 1;
One way whould be to store 3 values as an integer 666876 (66/68/76).
Then, you can do a query of the minimun value of the substraction of productSize - userSize, that is greater than 0.
This approach takes always the clothing higher size.
I'm retrieving 4 random rows from a table. However, I'd like it such that more weight is given to rows that had just been inserted into the table, without penalizing older rows much.
Is there a way to do this in PHP / SQL?
SELECT *, (RAND() / id) AS o FROM your_table ORDER BY o LIMIT 4
This will order by o, where as o is some random integer between 0 and 1 / id, which means, the older your row, the lower it's o value will be (but still in random order).
I think an agreeable solution would be to use an asymptotic function (1/x) in combination with weighting.
The following has been tested:
SELECT *, (Rand()*10 + (1/(max_id - id + 1))) AS weighted_random
FROM tbl1
ORDER BY weighted_random
DESC LIMIT 4
If you want to get the max_id within the query above, just replace max_id with:
(SELECT id FROM tbl1 ORDER BY id DESC LIMIT 1)
Examples:
Let's say your max_id is 1000 ...
For each of several ids I will calculate out the value:
1/(1000 - id + 1) , which simplifies out to 1/(1001 - id):
id: 1000
1/(1001-1000) = 1/1 = 1
id: 999
1/(1001-999) = 1/2 = .5
id: 998
1/(1001-998) = 1/3 = .333
id: 991
1/(1001-991) = 1/10 = .1
id: 901
1/(1001-901) = 1/100 = .01
The nature of this 1/x makes it so that only the numbers close to max have any significant weighting.
You can see a graph of + more about asymptotic functions here:
http://zonalandeducation.com/mmts/functionInstitute/rationalFunctions/oneOverX/oneOverX.html
Note that the right side of the graph with positive numbers is the only part relevant to this specific problem.
Manipulating our equation to do different things:
(Rand()*a + (1/(b*(max_id - id + 1/b))))
I have added two values, "a", and "b"... each one will do different things:
The larger "a" gets, the less influence order has on selection. It is important to have a relatively large "a", or pretty much only recent ids will be selected.
The larger "b" gets, the more quickly the asymptotic curve will decay to insignificant weighting. If you want more of the recent rows to be weighted, I would suggest experimenting with values of "b" such as: .5, .25, or .1.
The 1/b at the end of the equation offsets problems you have with smaller values of b that are less than one.
Note:
This is not a very efficient solution when you have a large number of ids (just like the other solutions presented so far), since it calculates a value for each separate id.
... ORDER BY (RAND() + 0.5 * id/maxId)
This will add half of the id/maxId ration to the random value. I.e. for the newest entry 0.5 is added (as id/maxId = 1) and for the oldest entry nothing is added.
Similarly you can also implement other weighting functions. This depends on how exactly you want to weight the values.
There are a few hundred of book records in the database and each record has a publish time. In the homepage of the website, I am required to write some codes to randomly pick 10 books and put them there. The requirement is that newer books need to have higher chances of getting displayed.
Since the time is an integer, I am thinking like this to calculate the probability for each book:
Probability of a book to be drawn = (current time - publish time of the book) / ((current time - publish time of the book1) + (current time - publish time of the book1) + ... (current time - publish time of the bookn))
After a book is drawn, the next round of the loop will minus the (current time - publish time of the book) from the denominator and recalculate the probability for each of the remaining books, the loop continues until 10 books have been drawn.
Is this algorithm a correct one?
By the way, the website is written in PHP.
Feel free to suggest some PHP codes if you have a better algorithm in your mind.
Many thanks to you all.
Here's a very similar question that may help: Random weighted choice The solution is in C# but the code is very readable and close to PHP syntax so it should be easy to adapt.
For example, here's how one could do this in MySQL:
First calculate the total age of all books and store it in a MySQL user variable:
SELECT SUM(TO_DAYS(CURDATE())-TO_DAYS(publish_date)) FROM books INTO #total;
Then choose books randomly, weighted by their age:
SELECT book_id FROM (
SELECT book_id, TO_DAYS(CURDATE())-TO_DAYS(publish_date) AS age FROM books
) b
WHERE book_id NOT IN (...list of book_ids chosen so far...)
AND RAND()*#total < b.age AND (#total:=#total-b.age)
ORDER BY b.publish_date DESC
LIMIT 10;
Note that the #total decreases only if a book has passed the random-selection test, because of short-circuiting of AND expressions.
This is not guaranteed to choose 10 books in one pass -- it's not even guaranteed to choose any books on a given pass. So you have to re-run the second step until you've found 10 books. The #total variable retains its decreased value so you don't have to recalculate it.
First off I think your formula will guarantee that earlier books get picked. Try to set your initial probabilities based on:
Age - days since publication
Max(Age) - oldest book in the sample
Book Age(i) - age of book i
... Prob (i) = [Max (age) + e - Book Age (i)] / sum over all i [ Max (age) + e - Book age(i) ]
The value e ensures that the oldest book has some probability of being selected. Now that that is done, you can always recalc the prob of any sample.
Now you have to find an UNBIASED way of picking books. Probably the best way would be to calculate the cumulative distribution using the above then pick a uniform (0,1) r.v. Find where that r.v. is in the cumulative distribution and pick the book nearest to it.
Can't help you on the coding. Make sense?
Not sure of the best way to go about this?
I want to create a tournament bracket of 2,4,8,16,32, etc teams.
The winner of the first two will play winner of the next 2 etc.
All the way until there is a winner.
Like this
Can anyone help me?
OK so more information.
Initially I want to come up with a way to create the tournament with the 2,4,8,16,etc.
Then when I have all the users in place, if they are 16 players, there are 8 fixtures.
At this point I will send the fixture to the database.
When all the players that won are through to the next round, i would want another sql query again for the 2 winners that meet.
Can you understand what i mean?
I did something like this a few years ago. This was quite a while ago and I'm not sure I'd do it the same way (it doesn't really scale to double-elimintation or the like) How you output it might be a different question. I resorted to tables as it was in 2002-2003. There are certainly better techniques today.
The amount of rounds in the tournament is log2(players) + 1, as long as players is one of the numbers you specified above. Using this information you can calculate how many rounds there are. The last round contains the final winner.
I stored the player information something like this (tweek this for best practices)
Tournament
Name
Size
Players
Tournament
Name
Position (0 to tournament.size - 1)
Rounds
Tournament
Round
Position (max halves for each round)
Winner (player position)
Note in all my queries below, I don't include the "Tournament = [tournament]" to identify the tournament. They all need it.
It's rather simple to query this with one query and to split it out as needed for the different rounds. You could do something like this to get the next opponent (assuming there is one). For round 1, you'd simply need to get the next/previous player based on if it was even or odd:
SELECT * FROM Players WHERE Position = PlayerPosition + 1
SELECT * FROM Players WHERE Position = PlayerPosition - 1
For the next round, if the user's last Round.Position was even, you'll need to make suer that the next position up has a winner:
SELECT Player FROM Rounds WHERE Position = [playerRoundPosition] - 1
If not, the next player isn't decided, or there's a gap (don't allow gaps!)
If the users last Round.Position was odd, you'll need make sure there's a user below them AND that there's a winner below them, otherwise they should automatically be promoted to the next round (as there is no one to play)
SELECT COUNT(*) FROM Players WHERE Position > [Player.Position]
SELECT Player FROM Rounds WHERE Position = [playerRoundPosition] + 1
On a final note, I'm pretty sure you could use something like the following to reduce the queries you write by using something like:
SELECT Player FROM Rounds WHERE Position + Position % 2 = [playerRoundPosition]
SELECT Player FROM Rounds WHERE Position - Position % 2 = [playerRoundPosition]
Update:
Looking over my original post, I find that the Rounds table was a little ambigous. In reality, it should be named matches. A match is a competition between two players with a winner. The final table should look more like this (only the name changed):
Matches
Tournament
Round
Position (max halves for each round)
Winner (player position)
Hopefully that makes it a bit more clear. When the two players go up against each other (in a match), you store that information in this Matches table. This particular implementation depends on the position of the Match to know which players participated.
I started numbering the rounds at 1 because that was more clear in my implementation. You may choose 0 (or even do something completely different like go backwords), if you choose.
In the first round, match 1 means players 1 and 2 participated. In match 2, the players 3-4 participated. Essentially the first round is simply players position and position + 1 participated. You could also store this information in the rounds table if you need more access to it. Every time I used this data in the program, I needed all the round and player information anyways.
After the first round, you look at the last round of matches. In round 2, match 1, the winners from matches 1 and 2 participate. Round 2, match 2, the winners from match 3 and 4 participate. It should look pretty familiar, except that it uses the match table after round 1. I'm sure there's a more efficent way to do this repetitive task, I just never got enough time to refactor that code (it was refactored, just not that much).
Use arrays and remove the losing teams from the main array. (But keep 'em on a separate array, for reference and reuse purposes).