I'm looking to create an SQL query (in MySQL) that will display 6 random, yet popular entries in my web application.
My database has the following tables:
favorites
submissions
submissions_tags
tags
users
submissions_tags and tags are cross-referencing tables that give each submission a certain number of tags.
submissions contains boolean featured, int downloads, and int views, all three of which I'd like to use to weight this query with.
The favorites table is again a cross-reference table with the fields submission_id and user_id. Counting the number of times each submission has been favorited would be good to weigh the results with.
So basically I want to select 6 random rows weighted with these four variables - featured, downloads, views, and favorite count. Each time the user refreshes the page, I want a new random 6 to be selected. So maybe the query could limit it to 12 most-recent but only pluck 6 random results out to show. Is that a sensible idea in terms of processing etc.?
So my question is, how can I go about writing this query? Where should I begin? I am using PHP/CodeIgniter to drive this site with. Is it possible to get the entire lot in one query, or will I have to use multiple queries to do this? Or, do I need to simplify my ideas?
Thanks,
Jack
I've implemented something similar to this before. The route I took was to have a script run on the server every XX minutes to fill a table with a pool of items (say 20-30 items). Then the query to use in your application would be randomly pick 5 or so from that table.
Just need to setup an algorithm to select those 20-30 items. #Emmerman's is similar to what I used before to calculate a popularity_number where I took weights of multiple associations to the item (views, downloads, etc) to get an overall number. We also used an age to make sure the pool of items stayed up-to-date. You'll have to tinker with the algorithm over time to make sure the relevant items are being populated.
The idea is to calc some popularity which can be for e.g.
popularity = featured*W1 + downloads*W2 + views*W3 + fcount*W4
Where W1-W4 are constant weights.
Then add some random number to popularity and sort for it.
Related
My situations is this... I have a table of opportunities that is sorted. We have a paid service that will allow people to view the opportunities on the website any time. However we want an unpaid view that will show a random %/# of opportunities, which will always be the same. The opportunities are sorted out by dates; e.g. they will expire and be removed from the list, and a new one should be on the free search. However the only problem is that they will always have to show the same opportunity. (For example, I can't just pick random rows because it will cycle through them if they keep refreshing, and likewise can't just take the ones about to expire or furthest form expiry because people still end up seeing the entire list.
My only solution thus far is to add an extra column to the table to mark that it is open display. Then to count them on display, and if we are missing rows then to randomly select a few more. Below is a mock up...
SELECT count(id) as total FROM opportunities WHERE display_status="open" LIMIT 1000;
...
while(total < requiredNumber) {
UPDATE opportunities SET display_status="open" WHERE display_status="private" ORDER BY random() LIMIT (required-total);
}
Can anyone think of a better way to solve this problem, preferably one that does not leave me adding another column to the table, and possible conflicts if many people load the page at a single time. One final note as well, it can't be a random set number of them (e.g. pick one, skip a few, take the next).
Any thought/comments would be very helpful,
Thanks.
One way to make sure that a user only sees the same set of random rows is to feed the random number generator a seed that is linked to that user (such as their user_id). That means every user gets a random ordering of rows but it's always the same random ordering for each user.
Your code would be something:
SELECT ...
FROM ...
WHERE ...
ORDER BY random(<user id>)
LIMIT <however many>
Note: as Twelfth pointed out, as new rows are created, they will get new order values and may end up in your random selection.
I'm the type that doesn't like to lose information...including what random rows someone got to see. However I do not like the modification of your existing table idea...
Create a second table as randon_rows or something to that extent to save the ID's of the user and the ID's of the random records they got to see. Inner join to the table whenever you need to find those same rows again. You can also put expirey dates and the sort in the table as well, so the user isn't perma stuck with the same 10 rows.
I'm working on already made Facemash-Alike Script. It's script that shows two pictures, and user make a choice which picture is better for him.
I wanted to create a small improvement that won't show a user the same combination of two pictures he already voted.
I tried to do this in two ways. But any of this ways is not good enough or not comfortable for user.
First one - Choices of two pictures are randomized. After vote, in database, new record is created with this specific combination, and value of vote. If combination of two pictures already exist as record in database then page shows historical vote, and after few seconds page refreshing, making another random combination.
Second one - In the moment when names of pictures are added to database then scripts creates all possible combinations as records in database. It's good way, because script pulls out from database a random record that doesn't contains any result, and after vote saves with a value. So it's no way to make any repeats. The main problem of this way is in the moment of adding new pictures. Database at the start becoming huge, and creating all possible combination at start taking forever.
Because of that I'm looking for another solution. I would like to hear even small advice that might help find me a way.
Your first approach scales better, you just want to avoid showing an historical vote. You need to keep a history of votes anyway, so use that history as a filter. In the SELECT statement you are using to get the random faces, left join on the history table to use the join as a filter.
Example:
SELECT faces.uid f_uid, votes.uid v_uid FROM faces
LEFT JOIN votes ON votes.user_id=# AND faces.uid=votes.face_id1 AND
faces.uid=votes.face_id2
WHERE v_uid IS NULL
ORDER BY RAND() LIMIT 2
That will make sure they never see the same face twice. It will become slower the more faces a user votes on. It won't be noticeably slower until they have done many hundreds of votes.
That said, you could change the LIMIT to something like 20 and cache that (i.e. in the session). You then have the next 10 pairings (20/2=10) ready to go. That is sort of a combination of 1 & 2.
I have a table which stores highscores for a game. This game has many levels where scores are then ordered by score DESC (which is an index) where the level is a level ID. Would partitioning on this level ID column create the same result as create many seperate level tables (one for each level ID)? I need this to seperate out the level data somehow as I'm expecting 10's of millions of entries. I hear partitioning could speed this process up, whilst leaving my tables normalised.
Also, I have an unknown amount of levels in my game (levels may be added or removed at any time). Can I specify to partition on this level ID column and have new partitions automaticaly get created when a new (distinct level ID) is added to the highscore table? I may start with 10 seperate levels but end up with 50, but all my data is still kept in one table, but many partitions? Do I have to index the level ID to make this work?
Thanks in advance for your advice!
Creting an index on a single column is good, but creating an index that contains two columns would be a better solution based on the information you have given. I would run a
alter table highscores add index(columnScore, columnLevel);
This will make performance much better. From a database point of view, no matter what highscores you are looking for, the database will know where to search for them.
On that note, if you can, (and you are using mysami tables) you could also run a:
alter table order by columnScore, columnLevel;
which will then group all your data together, so that even though the database KNOWS where each bit is, it can find all the records that belong to one another nearby - which means less hard drive work - and therefore quicker results.
That second operation too, can make a HUGE difference. My PC at work (horrible old machine that was top of the range in the nineties) has a database with several million records in it that I built - nothing huge, about 2.5gb of data including indexes - and performance was dragging, but ordering the data for the indexes improved query time from about 1.5 minutes per query to around 8 seconds. That's JUST due to hard drive speed in being able to get to all the sectors that contain the data.
If you plan to store data for different users, what about having 2 tables - one with all the information about different levels, another with one row for every user alongside with his scores in XML/json?
I'm running a sql query to get basic details from a number of tables. Sorted by the last update date field. Its terribly tricky and I'm thinking if there is an alternate to using the UNION clause instead...I'm working in PHP MYSQL.
Actually I have a few tables containing news, articles, photos, events etc and need to collect all of them in one query to show a simple - whats newly added on the website kind of thing.
Maybe do it in PHP rather than MySQL - if you want the latest n items, then fetch the latest n of each of your news items, articles, photos and events, and sort in PHP (you'll need the last n of each obviously, and you'll then trim the dataset in PHP). This is probably easier than combining those with UNION given they're likely to have lots of data items which are different.
I'm not aware of an alternative to UNION that does what you want, and hopefully those fetches won't be too expensive. It would definitely be wise to profile this though.
If you use Join in your query you can select datas from differents tables who are related with foreign keys.
You can look of this from another angle: do you need absolutely updated information? (the moment someone enters new information it should appear)
If not, you can have a table holding the results of the query in the format you need (serving as cache), and update this table every 5 minutes or so. Then your query problem becomes trivial, as you can have the updates run as several updates in the background.
What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link
I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.
My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.
MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().
MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.
I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?