I used this link to create a ranking algorithm for my website. It is working great. I basically have it done like this:
LOG10((s.views) + (s.likes * 2) + 1) * 287015 + UNIX_TIMESTAMP(m.check_time) AS Hotness
The query is then sorted by Hotness which gives me the largest rank on top. I would like to take Hotness which is a manipulated time value and convert it to a score such as 1 - 100. Hotness looks like this in the query: '1469612365.0402453'. Is there anything I can do with that?
Related
I have a table of over 300,000 rows and I would like to render this data on a graph, but 300,000 rows isn't really necessary all at once. For example, even though there may be 100 rows of data for a given day, I don't need to display all that data if I'm showing a whole year worth of data. So I would like to "granularize" the data.
I was thinking of getting everything and then using a script to remove what I don't need, but that seems like it would be much slower and harder on the database.
So here's what I have so far.
SET #row_number := 0;
SELECT #row_number := #row_number + 1 as row_number,
price, region, timestamp as row_number FROM pricehistory;
This gives me all the rows and numbers them. I was planning on adding a where clause to get every 1000 rows (i.e. every nth row) like this
SET #row_number := 0;
SELECT #row_number := #row_number + 1 as row_number,
price, region, timestamp as row_number FROM pricehistory
WHERE row_number % 1000 = 0;
But MYSQL doesn't see row_number as a column for some reason. Any ideas? I've looked at other solutions online, but they don't seem to work for MYSQL in particular.
As Racil's comment suggested, you can just go by an auto-incremented id field if you have one; but you've stated the amount of data for different dates could be different, so this could make for a very distorted graph. If you select every 1000th record for a year and half the rows are from the last 3 months ("holiday shopping" for a commerce example), the latter half of a year graph will actually reflect the latter quarter of the year. For more useful results you're most likely better off with something like this:
SELECT region, DATE(timestamp) AS theDate
, AVG(price), MIN(price), MAX(price)
FROM pricehistory
GROUP BY region, theDate
;
It doesn't look like I'm going to get another answer so I'll go ahead and write the solution I came up with.
My data is pretty evenly distributed as it grabs prices at regular intervals so there's no reason to worry about that.
Here's my solution.
Let's say I have 500,000 rows and I want to display a subset of those rows let's say 5000 rows. 500000/5000 is 100 so I take 100 and use it in my select statement like this SELECT * FROM pricehistory where id % 100 = 0;
Here is the actual code
public function getScaleFactor($startDate, $endDate) {
$numPricePoints = $this->getNumPricePointsBetweenDates($startDate, $endDate);
$scaleFactor = 1;
if ($numPricePoints > $this->desiredNumPricePoints) {
$scaleFactor = floor($numPricePoints / $this->desiredNumPricePoints);
}
return $scaleFactor;
}
I then use $scaleFactor in the SQL like this SELECT * FROM pricehistory WHERE id % {$scaleFactor} = 0;
This isn't a perfect solution because you don't always end up with 5000 rows exactly, but I don't NEED exactly 5000 rows. I'm just trying to reduce the resolution of the data while still getting a graph that looks close to what it would be had I used all 500,000 rows.
I have this table:
person_id int(10) pk
fid bigint(20) unique
points int(6) index
birthday date index
4 FK columns int(6)
ENGINE = MyISAM
Important info: the table contains over 8 million rows and is fast growing (1.5M a day at the moment)
What I want: to select 4 random rows in a certain range when I order the table on points
How I do it now: In PHP I randomize a certain range, let's say this gives me 20% as low range and 30% as high range. Next I count(*) the number of rows in table. After I determine the lowest row number: table count / 100 * low range. Same for high range. After I calculate a random row by using rand(lowest_row, highest_row), which gives me a row number within the range. And at last I select the random row by doing:
SELECT * FROM `persons` WHERE points > 0 ORDER BY points desc LIMIT $random_offset, 1;
The points > 0 is in the query since I only want randoms with at least 1 point.
Above query takes about 1.5 seconds to run, but since I need 4 rows it takes over 6 seconds, which is too slow for me. I figured the order by points takes the most time, so I was thinking about making a VIEW of the table, but I have really no experience with views, so what do you think? Is a view a good option or are there better solutions?
ADDED:
I forgot to say that it is important that all rows has the same chance of being selected.
Thanks, I appreciate all the help! :)
Kevin
Your query is so slow, and will become exponentially slower, because using LIMIT here forces it to do a full table sort, and then a full table scan, to get the result. Instead you should do this on the PHP end of things as well (this kind of 'abuse' of LIMIT is actually the reason it's non-standard SQL and for example MSSQL and Oracle do not support it).
First ensure there's an index on points. This will make select max(points), min(points) from persons a query that'll return instantly. Next you can determine from those 2 results the points range, and use rand() to determine 4 points in the requested range. Then repeat for each result:
SELECT * FROM persons WHERE points < $myValue ORDER BY points DESC LIMIT 1
Since it only has to retrieve one row, and can determine which one via the index, this'll be in the milliseconds execution time as well.
Views aren't going to do anything to help your performance here. My suggestion would be to simply run:
SELECT * FROM `persons` WHERE points BETWEEN ? AND ?
Make sure you have an index on points. Also, you SHOULD replace * with only the fields you are concerned about if applicable. Here is course ? represents the upper and lower bounds for your search.
You can then determine the number of rows returned in the result set using mysqli_num_rows() (or similar based on your DB library of choice).
You now have the total number of rows that meet your criteria. You can easily then calculate 4 random numbers within the range of results and use mysqli_data_seek() or similar to go directly to the record at the random offset and get the values you want from it.
Putting it all together:
$result = mysqli_query($db_conn, $sql); // here $sql is your SQL query
$num_records = 4; // your number of records to return
$num_rows = mysqli_num_rows($result);
$rows = array();
while ($i = 0; $i < $num_records; $i++) {
$random_offset = rand(0, $num_rows - 1);
mysqli_data_seek($result, $random_offset);
$rows[] = mysqli_fetch_object($result);
}
mysqli_free_result($result);
I have an exe which return an array of 16elements.I have to pass this array to Mysql using php to calculate the Euclidean distance.My table in MySQL is in the form.
id |img_id | features_1|features_2|features_3|features_4|features_5|features_6|features_7|...upto features_16
1 1 0.389 0.4567 0.8981 0.2345
2 2 0.9878 0.4567 0.56122 0.4532
3 3
4 4
......................
So I have 16 features for each image and now I have 30,000 images that is img_id is upto 30,000. I have to calculate the euclidean distance of the array from exe which is passed through php with the datas in the database and return the img_id of the 6 images whose euclidean distance is minimum. i.e. Suppose I have an array from exe A[0.458,0.234,0.4567,0.2398] I have to compute distance of each img_id with this array i.e for img_id=1 the distance will be ((0.458-0.389)^2+(0.234-0.4567)^2+(0.4567-0.8981)^2+(0.2398-0.2345)^2) and I have to repeat this process for all the 30,000 images and return the 6 img_id who has least distance. What is the efficient and fast way to calculate it?
Since php is slow you should do this directly in the SQL like this:
SELECT * FROM tablename
ORDER BY ABS(f1 - :f1) + ABS(f2 - :f2) + ... DESC
LIMIT 6;
Note that I used the absolute norm instead of the euclidian norm which makes no difference if you are not interested in the actual values (because in vector spaces with finite dimension all norms are equivalent). sqlite for eample does not provide the SQUARE function and writing (f1 - :f1) * (f1 - :f1) all the time is anoying, so I guess this is a nice solution.
I'm retrieving 4 random rows from a table. However, I'd like it such that more weight is given to rows that had just been inserted into the table, without penalizing older rows much.
Is there a way to do this in PHP / SQL?
SELECT *, (RAND() / id) AS o FROM your_table ORDER BY o LIMIT 4
This will order by o, where as o is some random integer between 0 and 1 / id, which means, the older your row, the lower it's o value will be (but still in random order).
I think an agreeable solution would be to use an asymptotic function (1/x) in combination with weighting.
The following has been tested:
SELECT *, (Rand()*10 + (1/(max_id - id + 1))) AS weighted_random
FROM tbl1
ORDER BY weighted_random
DESC LIMIT 4
If you want to get the max_id within the query above, just replace max_id with:
(SELECT id FROM tbl1 ORDER BY id DESC LIMIT 1)
Examples:
Let's say your max_id is 1000 ...
For each of several ids I will calculate out the value:
1/(1000 - id + 1) , which simplifies out to 1/(1001 - id):
id: 1000
1/(1001-1000) = 1/1 = 1
id: 999
1/(1001-999) = 1/2 = .5
id: 998
1/(1001-998) = 1/3 = .333
id: 991
1/(1001-991) = 1/10 = .1
id: 901
1/(1001-901) = 1/100 = .01
The nature of this 1/x makes it so that only the numbers close to max have any significant weighting.
You can see a graph of + more about asymptotic functions here:
http://zonalandeducation.com/mmts/functionInstitute/rationalFunctions/oneOverX/oneOverX.html
Note that the right side of the graph with positive numbers is the only part relevant to this specific problem.
Manipulating our equation to do different things:
(Rand()*a + (1/(b*(max_id - id + 1/b))))
I have added two values, "a", and "b"... each one will do different things:
The larger "a" gets, the less influence order has on selection. It is important to have a relatively large "a", or pretty much only recent ids will be selected.
The larger "b" gets, the more quickly the asymptotic curve will decay to insignificant weighting. If you want more of the recent rows to be weighted, I would suggest experimenting with values of "b" such as: .5, .25, or .1.
The 1/b at the end of the equation offsets problems you have with smaller values of b that are less than one.
Note:
This is not a very efficient solution when you have a large number of ids (just like the other solutions presented so far), since it calculates a value for each separate id.
... ORDER BY (RAND() + 0.5 * id/maxId)
This will add half of the id/maxId ration to the random value. I.e. for the newest entry 0.5 is added (as id/maxId = 1) and for the oldest entry nothing is added.
Similarly you can also implement other weighting functions. This depends on how exactly you want to weight the values.
I have an application that stores data in database. I need search functionality to work on this database.
For this to work I need a "relevance" score, a score that is calculated based on a set of criteria to output as a value that can be then used to order a set of data.
Say for instance the user enters three keywords: X, Y and Z - I need to generate a score based on a database entry. I wish the criteria to be related to how many times each appears.
Example:
Database Entry A - X appears 8 times Y appears once and Z appears once. Giving a collective score of 10.
Database Entry B - X appears 24 times Y does not appear and Z does not appear. Giving a collective score of 24.
Here's my problem. Database Entry A IS more relevant based on the search of XYZ because it has all three database entries, not just one, yet a standard calculation would class Database Entry B as more relevant.
I need to figure out a way to calculate the results and give an number score to the result based on not just how many of each keyword appears, but also giving higher scores for those results that have more than one keyword displayed, exponentially (i.e. entering 10 keywords would show results where all 10 appear above ones with large amounts of one).
I need to achieve this with PHP which will be retrieving my database results and feeding them back to my website page.
You could compute two relevance scores. One that rates based on how many fields provided a match, and then your regular "how matches were found". From your examples, that would provide:
Example A - field_count: 3, match_count: 10
Example B - field_count: 1, match_count: 24
and then have your query do
ORDER BY field_count, match_count
so that matches with more fields get sorted first.
Since the (first) presence of a keyword is so important, give it a better score than the rest of the occurrences. For example:
$score = 0;
foreach ($keywords as $count) {
$score += $count==0 ? 0 : 1000000;
$score += $count;
}
If you apply this algorithm to your example, you will have:
Entry1 ---> (1000000 + 8) + (1000000 + 1) + (1000000 + 1) = 3000010
Entry2 ---> (1000000 + 24) = 1000024
So Entry1 scores better than Entry2 as you wanted.