I am trying to build a poor man's recommendation system for a online store.
I want to realize that kind of Amazon "Customers Who Bought This Item Also Bought" feature and I read a lot about it.
I know there is that Apache Mahout thing, but I am unable to tweak the server that way. Then there would be the google prediction API, but it cost money so I start experimenting myself.
I got an orderhistory with 250.000+ items and I wrote a nested MySQL Query to find orders which contain the current article, rank the other order items and sort that table for ranking, so I got a set of products which other people ordered along with the current article.
The problem is, the query could take up to 10sec - so this can't be used directly.
I thought about a caching table, but this query stops after 20 minutes (there are 60.000 products and 250.000 ordered items) So I am unable to fill that table.
My current workaround is the following:
The recommendation HTML is loaded via AJAX ondocumentready, so the site loads, while the recommendation loads in the background. The recommendation data is processed once and stored in a filecache (PEAR simple cache) so it loads faster the next time. So the cache is made on demand if someone visits the site and stored for a day or maybe a week.
I ask myself and you, would that be an acceptable approach or is it stupid and unperformant?
Would it be better to store the cached data in a db or in file (I think about performance and parallel hits). I mean, in the worst case I would endup with 60.000 cachefiles.
I would prefer a pre-computed table with all the data, but as I said it takes to long and I don't know how to optimize it. (Waiting till the SQL Dude come back from holidays ^^)
Thanks for any hint, opinion.
btw. this is the query:
SELECT c.ArtNr as artnr , count(c.ArtNr) as rank, s.ArtNr as parent_artnr
FROM (
SELECT a.ID_order, a.ArtNr
FROM net_orderposition a
WHERE a.ArtNr = 'TT-PV0005'
) s
JOIN net_orderposition c
WHERE s.ID_order = c.ID_order AND s.ArtNr != c.ArtNr
GROUP BY c.ArtNr
ORDER BY rank DESC,c.Stamp DESC
LIMIT 10;
EDIT:
I thought about the given answers and I think they are similar to my initial idea.
The above code result in the following table:
ID,ParentID , ChildID , Rank
1, TT-PV0005, TT-PV0040, 220
2, TT-PV0005, TT-PV0355, 135
3, TT-PV0005, TT-PV0450, 134
4, TT-PV0005, TT-PV0451, 89
5, TT-PV0005, RH-01V2 , 83
6, TT-PV0005, TT-PV0041, 83
7, TT-PV0005, TT-PV0353, 82
8, TT-PV0005, TT-PV0037, 80
The ParentID is the current item, ChildID the items that ordered in the past along with ParentID, Rank is the precomputed count of how often the child is ordered with current item.
Now I can UPDATE or INSERT related items on every new order and count up Rank if it's already present in DB.
The only thing I fear, I will endup in a really really big table.
Maybe it shouldn't be a problem, if I precalculate it offline once a week?
But then I have to optimize the query so it doesn't take 10 sec per item.
What do you think?
check out easyrec it has the features you need and is free. no tweaking needed and you can use the Demo instance like google analytics. I think it will be much easier to just use this free to use web service then code the whole logic on your own.
In a tweet today they mentioned that they support full mahout support to easyrec so you have the whole thing with easyrec.You can either use easyrec's free webservice or deploy the free WAR file on your webserver.
To add to #GalacticCowboy's answer and fill in where you're comment was, #Marcus...
One schema to accomplish this would be to create a table like:
RelatedItems
RelatedItemsId
purchasedItemId
relatedItemId
Then when an order is completed (or viewed depending on your requirements) you'd write records to the RelatedItems table, where each item purchased gets a record where that id is the purchasedItemId. Then all the other items would be written as the relatedItemId.
For example, if I made a purchase of Items 5, 9, 12, and 19, I would have 12 records that were written to my table that look like:
RelatedItemId, PurchasedItemId, RelatedItemId
1, 5, 9
2, 5, 12
3, 5, 19
4, 9, 5
5, 9, 12
6, 9, 19
7, 12, 5
8, 12, 9
9, 12, 19
10, 19, 5
11, 19, 9
12, 19, 12
Then you could usage a query similar to GalacticCowboy to get the top 10 items that were normally purchased alongside any of those items.
Please note, this is not the most efficient schema for a task like this, it could be tweaked quite a bit to reduce redundant data, but given that we don't know an awful lot about your system and overall schema design (and what seems a shaky understanding of some SQL concepts) I'm not going to go deep into that.
Every time there's an order, store a relationship record between the different items in the order. Then do something like:
SELECT ItemID, COUNT(RelatedItemID) AS RelatedItemCount
FROM RelatedItems
WHERE RelatedItemID = #viewingItemID
GROUP BY ItemID
ORDER BY RelatedItemCount DESC
LIMIT 10
You could also presummarize this using an overnight process or something and have a table that only contains the top n related items for each item ID.
Related
I am thinking to arrange office room.
Office room is always noisy, and you are thinking to separate each room users as possible as you
can so that they don’t feel uncomfortable.
If they are facing each other, we will add unhappy point as 1.
INPUT
What we can do here is based on given rooms and users, arrange room separately so people don’t
feel uncomfortable.
[row, column, users] -> unhappy points
Example 1: [2, 3, 6]
*2 Rows, 3 Columns, 6 people
Example 2: [3, 3, 8]
*3 Rows, 3 Columns, 8 people
Sample Output
]
Following are some Test Cases :
[5,2,8]-> 7
[3,5,14]-> 18
[1,16,1]-> 0
[3,5,1]-> 0
[8,2,12]-> 10
[16,1,1]-> 0
[3,3,6]-> 3
[2,6,12]-> 16
[15,1,0]-> 0
[5,3,7]-> 0
[4,3,5]-> 0
I need either mathematical solution or programming solution in PHP.
This is not a complete question because how will you determine that two rooms are facing each other and one more thing is that how are you counting 7 unhappy points for first example.
I've observes that if you are taking a matrix lets say 4 by 5 than you can put 4*5=20 peoples there.So how will you count more than 20 points while we have only 20 people?
Let me start off by stating that I'm a just a self-taught hobbyist at this, so I'm sure I'm doing some things wrong or ineffciently, so any feedback is appreciated. If this question is moot because I've made fundamental errors and need to start from scratch, I guess I need to know so I'll become better.
With that, here's the problem:
I have a database of birth names in MySQL that is intended to let you find the frequency of those names within a given year range. My only table has a lot of columns:
**Name** **Begins** **Popularity** **1800** **1801** **1802**
Aaron A 500 6 7 4
Amy A 100 10 2 12
Ashley A 250 2 5 7
...and so forth until 2013.
Right now I've written a PHP page that can call up a list of names based on the start letter over the entire year range (1800-2013). That works, but what I'd like to do is to let the user specify a custom year range from the dropdowns I put on the home page and use that to calculate the frequency of each name for the custom year range only. I'd also like to be able to sort the resulting list based on those frequency values, not the all-time frequency stored in 'Popularity'.
From what I've looked at, I'm thinking part of the solution might lie in using custom views but I just can't seem to put the pieces all together. Or should I somehow pre-calculate all possible combinations?
Here's is the working query code I'm using right now:
{$query = "SELECT Name
FROM nametable
WHERE Gender = '$genselect'
AND
(BeginsWith = '$begins')
ORDER BY $sortcolumn $sortorder";
goto resultspage;
}
resultspage:
$result = mysqli_query($dbcnx, $query)
or die ("Error in query: $query.".mysqli_error($dbcnx));
$rows = $result->num_rows;
echo "<br>You found $rows names!<br>";
while($row=mysqli_fetch_assoc($result))
{
echo '<br>'.$row['Name'];
}
I think you're going to have to consider structuring your data in a different way to make the most of using an RDBMS.
If it were me, I'd be looking at normalising data into different tables in the first instance and disposing of unnecessary fields such as "Begins" and "Popularity". That kind of information can easily be reproduced or sought out in PHP or within a query itself. The advantage here is that you also reduce the number of columns that actually need to be maintained.
I haven't worked out a silver bullet schema but, roughly, I'd start with something along these lines and expand/modify where appropriate:
Names
- id
- name
- genderID
Genders
- id
- code
Years
- id
Frequencies
- id
- nameID
- yearID
- number
So, for example, a segment of your data may take the following shape:
Names (1, Aaron, 1)
Genders (1, Male)
Years (1987)
Frequencies (1, 1, 1987, 6), (1, 1, 1988, 19)
The beauty of having your data separated out like this is that it becomes much easier to query it. So, if you wanted the frequency of occurrences of the name Aaron between 1987 and 1988 you could do something like the following:
SELECT SUM(frequencies.number) FROM frequencies WHERE frequencies.yearID
BETWEEN 1987 AND 1988
AND frequencies.nameID = 1
Furthermore, doing away with the "Begins" column would mean you can structure a query to use "LIKE"
SELECT * FROM names WHERE name LIKE "A%"
My examples are perhaps a bit contrived but hopefully they illustrate what I'm getting at.
One thing I haven't touched upon is how you might go about physically entering the data. What happens when a new name is added? Does a corresponding entry get made in the frequencies table automatically? Is a check performed in the frequencies table first and, if an entry exists, does it automatically increment the number?
These are important problems to consider but probably best left until after a schema is settled upon.
I am building an application that helps manage frisbee "hat tournaments". The idea is people sign up for this "hat tournament". When they sign up, the provide us with a numeric value between 1 and 6 which represents their skill level.
Currently, we are taking this huge list of people who signed up, and manually trying to create teams out of this based on the skill levels of each player. I figured, I could automate this by creating an algorithm that splits up the teams as evenly as possible.
The only data feeding into this is the array of "players" and a desired "number of teams". Generally speaking we are looking at 120 players and 8 teams.
My current thought process is to basically have a running "score" for each team. This running score is the total of all assigned players skill levels. I loop through each skill level. I go through rounds of picks once inside skill level loop. The order of the picks is recalculated each round based on the running score of a team.
This actually works fairly well, but its not perfect. For example, I had a range of 5 pts in my sample data array. I could very easily, manually swap players around and make the discrepancy no more then 1 pt between teams.. the problem is getting that done programatically.
Here is my code thus far: http://pastebin.com/LAi42Brq
Snippet of what data looks like:
[2] => Array
(
[user__id] => 181
[user__first_name] => Stephen
[user__skill_level] => 5
)
[3] => Array
(
[user__id] => 182
[user__first_name] => Phil
[user__skill_level] => 6
)
Can anyone think of a better, easier, more efficient way to do this? Many thanks in advance!!
I think you're making things too complicated. If you have T teams, sort your players according to their skill level. Choose the top T players to be captains of the teams. Then, starting with captain 1, each captain in turn chooses the player (s)he wants on the team. This will probably be the person at the top of the list of unchosen players.
This algorithm has worked in playgrounds (and, I dare say on the frisbee fields of California) for aeons and will produce results as 'fair' as any more complicated pseudo-statistical method.
A simple solution could be to first generating a team selection order, then each team would "select" one of the highest skilled player available. For the next round the order is reversed, the last team to select a player gets first pick and the first team gets the last pick. For each round you reverse the picking order.
First round picking order could be:
A - B - C - D - E
second round would then be:
E - D - C - B - A
and then
A - B - C - D - E etc.
It looks like this problem really is NP-hard, being a variant of the Multiprocessor scheduling problem.
"h00ligan"s suggestions is equivalent to the LPT algorithm.
Another heuristic strategy would be a variation of this algorithm:
First round: pick the best, second round: pair the teams with the worst (add from the end), etc.
With the example "6,5,5,3,3,1" and 2 teams this would give the teams "6,1,5" (=12) and "5,3,3" (=11). The strategy of "h00ligan" would give the teams "6,3,3" (=12) and "5,5,1" (=11).
This problem is unfortunately NP-Hard. Have a look at bin packing which is probably a good place to start and includes an algorithm you can hopefully tweak, this may or may not be useful depending on how "fair" two teams with the same score need to be.
This is a tough one. There is probably a name for this and I don't know it, so I'll describe the problem exactly.
I have a dataset including a number of user-submitted values. I need to be able to determine based on some sort of average, or better, a "closeness of data", which value is the correct value. For example, if I received the following three submissions from three users, 4, 10, 3, I would know that 3 or 4 would be the "correct" value in this case. If I were to average it out, I'd get 5.6 which is not the intended result.
I'm attempting to do this using MySQL and PHP.
tl;dr Need to find a value from a dataset based on "closeness" of relative values (using MySQL/PHP)
Thanks!
Clustering using a database isn't going to be a single query type of procedure. It takes iterations to generate the clusters effectively.
You first need to decide how many clusters you want. If you wanted only one cluster, then obviously everything would go into it. If you want two, then you can write your program to separate the nodes into two groups using some sort of correlation metric.
In other words, I don't think this is a MySQL question so much as a clustering question.
I think that is the kind of thing you're looking for:
SELECT id, MIN(ABS(id - (SELECT AVG(id) FROM table))) as min
FROM table
GROUP BY id
ORDER BY min
LIMIT 1;
Per example, if your data set contains the following IDs: 3, 4, 10, with an average of 5.6667. The closest value to 5.6667 is 4. If your data set is 3, 6, 10, 14, with an average of 8.25, the clostest value is 10.
This is what this query returns. Hope it helps.
I have the impression you are looking for the median
E.g. in the list 1 2 3 4 100, the median (central value) is 3.
You may want to search for [https://stackoverflow.com/search?q=sql+median finding the median in SQL].
I'm making a digg-like website that is going to have a homepage with different categories. I want to display the most popular submissions.
Our rating system is simply "likes", like "I like this" and whatnot. We basically want to display the submissions with the highest number of "likes" per time. We want to have three categories: all-time popularity, last week, and last day.
Does anybody know of a way to help? I have no idea how to go about doing this and making it efficient. I thought that we could use some sort of cron-job to run every 10 minutes and pull in the number of likes per the last 10 minutes...but I've been told that's pretty inefficient?
Help?
Thanks!
Typically Digg and Reddit-like sites go by the date of the submission and not the times of the votes. This way all it takes is a simple SQL query to find the top submissions for X time period. Here's a pseudo-query to find the 10 most popular links from the past 24 hours using this method:
select * from submissions
where (current_time - post_time) < 86400
order by score desc limit 10
Basically, this query says to find all the submissions where the number of seconds between now and the time it was posted is less than 86400, which is 24 hours in UNIX time.
If you really want to measure popularity within X time interval, you'll need to store the post and time for every vote in another table:
create table votes (
post foreign key references submissions(id),
time datetime,
vote integer); -- +1 for upvote, -1 for downvote
Then you can generate a list of the most popular posts between X and Y times like so:
select sum(vote), post from votes
where X < time and time < Y
group by post
order by sum(vote) desc limit 10;
From here you're just a hop, skip, and inner join away from getting the post data tied to the returned ids.
Do you have a decent DB setup? Can we please hear about your CREATE TABLE details and indices? Assuming a sane setup, the DB should be able to pull the counts you require fast enough to suit your needs! For example (net of indices and keys, that somewhat depend on what DB engine you're using), given two tables:
CREATE TABLE submissions (subid INT, when DATETIME, etc etc)
CREATE TABLE likes (subid INT, when DATETIME, etc etc)
you can get the top 33 all-time popular submissions as
SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33
and those voted for within a certain time range as
SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
WHERE likes.when BETWEEN initial_time AND final_time
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33
If you were storing "votes" (positive or negative) in likes, instead of just counting each entry there as +1, you could simply use SUM(likes.vote) instead of the COUNTs.
For stable list like alltime, lastweek, because they are not supposed to change really fast so that I think you should save the list in your cache with expiration time is around 1 days or longer.
If you concern about correct count in real time, you can check at every page view by comparing the page with lowest page in the cache.
All you need to do is to care for synchronizing between the cache and actual database.
thethanghn
Queries where the order is some function of the current time can become real performance problems. Things get much simpler if you can bucket by calendar time and update scores for each bucket as people vote.
To complete nobody_'s answer I would suggest you read up on the documentation (if you are using MySQL of course).