I'm developing a whisky information system in PHP connected to a mySQL database with 3 tables consisting of bottles (about 100 in total), users and bottles certain users have added as a favourite to their whisky shelf.
I'm attempting to build a function to recommend whiskies to the user based on the current whiskies they have added on their whisky shelf.
Each whisky has a 'flavour profile' with 12 different flavour features (e.g. if the whisky is nutty, smoky e.t.c) each feature is ranked on a scale from 0 to 4. So I basically have 12 numbers to play with and compare to another 12 numbers.
I've done a fair bit of research on the subject, but can only find simple implementations comparing one rating to another, but I can't think of an efficient way to compare 12 numbers and return some kind of match percentage.
I was wondering if anyone has any suggestions on the best method to compare the whiskies in the database to the whiskies in the users favourites and recommend the closest matches?
What you are trying to accomplish is, in essence, Pandora for Whiskey. You will have to devise an algorithm which will compare different characteristics and provide some sort of weight that will affect the overall outcome. This is not a trivial process and your algorithm will undergo modification many times before it works optimally.
| CHARACTERISTICS | YOUR WHISKEY | WHISKEY #1 | WHISKEY #2|
---------------------------------------------------------------
| Smoky | x | | x |
---------------------------------------------------------------
| Nutty | | x | x |
---------------------------------------------------------------
In the above example, YOUR WHISKEY is one that you like, and WHISKEY #2 has more of your desired characteristics than WHISKEY #1. That is a very simple comparison, and not doesn't factor very much into it.
You need to sit down with your possible data, create an algorithm, and then try it out on people. If it doesn't quite work right, tweak the algorithm some more. It's a continuous process that will eventually work as you want.
This similar post on collaborative filtering and recommendation systems might provide some more useful insight: What is algorithm behind the recommendation sites like last.fm, grooveshark, pandora?
Related
I have a MySQL table with area and lat/lon location columns. Every area has many locations, say 20.000. Is there a way to pick just a few, say 100, that look somewhat evenly distributed on the map?
The distribution doesn't have to be perfect, query speed is more important. If that is not possible directly with MySQL a very fast algorithm that somehow picks evenly distributed locations might also work.
Thanks in advance.
Edit: answering some requests in comments. The data doesn't have something that can be used, it's just area and coordinates of locations, example:
+-------+--------------+----------+-----------+------------+--------+--------+
| id | area | postcode | lat | lon | colour | size |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens | 10431 | 37.983917 | 23.7293599 | red | big |
| 16995 | Athens | 11523 | 37.883917 | 23.8293599 | green | medium |
| 16996 | Athens | 10432 | 37.783917 | 23.7293599 | yellow | small |
| 17000 | Thessaloniki | 54453 | 40.783917 | 22.7293599 | green | small |
+-------+--------------+----------+-----------+------------+--------+--------+
There are some more columns with characteristics but those are just used for filtering.
I did try getting the nth row in the meantime, it seems to work although a bit slow
SET #a = 0;
select * from `locations` where (#a := #a + 1) % 200 = 0
using random() also works but a bit slow too.
Edit2: Turns out it was easy to add postal codes on the table. Having that, getting grouped by postal code seems to give a nice to the eye result. Only issue is that there are very large areas with around 3000 distinct postcodes and getting just 100 may end up many of them show in one place, so will probably need to further process in PHP.
Edit3, answering #RickJames questions in comments so they are in one place:
Please define "evenly distributed" -- evenly spaced in latitude? no two are "close" to each other? etc.
"Evenly distributed" was a bad choice of words. We just want to show some locations on the area that are not all in one place
Are the "areas" rectangles? Hexagons? Or gerrymandered congressional districts?
They can be thought roughly as rectangles but it shouldn't matter. Important thing I missed, we also need to show locations from multiple areas. Areas may be far apart from each other or neighboring (but not overlap). In that case we'd want the sample of 100 to be split between the areas.
Is the "100 per area" fixed? Or can it be "about 100"
It's not fixed, it's around 100 but we can change this if it doesn't look nice
Is there an AUTO_INCREMENT id on the table? Are there gaps in the numbers?
Yes there is an AUTO_INCREMENT id and can have gaps
Has the problem changed from "100 per area" to "1 per postal code"?
Nope the problem is still the same, "show 100 per area in a way that not all of them are in the same place", how this is done it doesn't matter
What are the total row count and desired number of rows in output?
Total row count depends on area and criteria, it can be up to 40k in an area. If total is more than 1000 we want to fall back showing just a random of 100. If 1000 or less we can just show all of them
Do you need a different sample each time you run the query?
Same sample or different sample even with the same criteria is fine
Are you willing to add a column to the table?
It's not up to me but if I have good argument then most probably we can add a new column
Here's an approach that may satisfy the goals.
Preprocess the table, making a new table, to get rid of "duplicate" items.
If the new table is small enough, a full scan of it may be fast enough.
As for "duplicates", consider this as a crude way to discover that two items land in the same spot:
SELECT ROUND(latitude * 5),
ROUND(longitude * 3),
MIN(id) AS id_to_keep
FROM tbl
GROUP BY 1,2
The "5" and "3" can be tweaked upward (or downard) to cause more (or fewer) ids to be kept. "5" and "3" are different because of the way the lat/lng are laid out; that ratio might work most temperate latitudes. (Use equal numbers near the equator, use a bigger ration for higher latitudes.)
There is a minor flaw... Two items very close to each other might be across the boundaries created by those ROUNDs.
How many rows are in the original table? How many rows does the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )
So I need an opinion / a way of solution on the matter below.
There is this questionnaire which has 67 questions, coded with PHP and uses a database (MySQL). By design the data table is as follows, where it contains ID and question numbers.
So,
I will generate a report with these answers. i.e. I'll get the mean, median for each question and show them on a user report screen. There are 493 rows now and want to think something which will not get longer and longer to process in time.
Any opinions or an approach which makes the process easier(bearable)? Shall I create a class for the calculations and run for each questions and store the values on a view? Found an answer here for a similar issue but just could not make sure. Really would love to hear any ideas.
Personally, I'd avoid using a table 67 columns wide, and do a 3-column table with a two-column Primary-key instead.
ID | Q | Result
1 | 1 | 1
1 | 2 | 3
1 | 3 | 2
...
4 | 5 | 4
Then run stats on that; it'll be 67 times longer, but your stats will be all be primary-key lookups. And anything less than a couple million rows will be pretty damned fast anyway.
Oh, and do the stats using mysql, it's good at that sort of thing. For example:
SELECT AVG(Result) WHERE Q = 1;
And use this solution for the median.
First of all: Sorry for the long post, I am trying to explain a hard situation in an easy way and, at the same time, trying to give as much information as I can.
I have an algorithm that tries to determine user expectation during a search. There are a couple of way I can use it and I have the same problem with both of them, so, lets say I use it for disambiguation. Well, with a db structure like this one (or any other that allows the work):
post
ID | TITLE
---+----------------------------------------------
1 | Orange developed the first 7G phone
2 | Orange: the fruit of gods
3 | Theory of Colors: Orange
4 | How to prepare the perfect orange juice
keywords
ID | WORD | ABOUT
---+----------+---------
1 | orange | company
2 | orange | fruit
3 | orange | color
post_keywords
ID | POST | KEYWORD
---+-------+---------
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 2
.
If in a search box, an user make a search for the word "orange", the algorithm would look that orange may refers to the company, the color, or the fruit and, by answering a couple of questions, it tries to determine which the user is looking for. After all that I get an array like this one:
$e = array(
'fruit' => 0.153257,
'color' => 0.182332,
'company' => 0.428191,
);
In this point I know the user is probably looking for information about the fruit (because fruit's value is closer to 0) and if I am wrong my second bet goes for the color. At the bottom of the list, the company.
So, with a Join and ORDER BY FIELD(keywords.id, 2,3,1) I can give the results the (almost) perfect order:
- Orange: the fruit of gods
- How to prepare the perfect orange juice
- Theory of Colors: Orange
- Orange developed the first 7G phone
.
Well... as you can imagine, I wouldn't come for help if everything is so nice. So, the problem is that in is the previous example we have only 4 possible results, so, if the user really was looking for the company he can find this result in the 4th position and everything is okay. But... If we have 200 post about the fruit and 100 post about the color, the first post about the company come in the position 301st.
I am looking for a way to alternate the order (in a predictable and repeatable way) now that I know the user is must likely looking for the fruit, followed by the color and the company at the end. I want to be able to show a post about the fruit in the first position (and possibly the second), followed by a post about the color, followed by the company and start this cycle again until the results ends.
Edit: I'll be happy with a MySQL trick or with an idea to change the approach, but I can't accept third-party solutions.
You can use variables to provide custom sort field.
SELECT
p.*,
CASE k.about
WHEN 'company' THEN #sort_company := #sort_company + 1
WHEN 'color' THEN #sort_color := #sort_color + 1
WHEN 'fruit' THEN #sort_fruit := #sort_fruit + 1
ELSE NULL
END AS sort_order,
k.about
FROM post p
JOIN post_keywords pk ON (p.id = pk.post)
JOIN keywords k ON (pk.keyword = k.id)
JOIN (SELECT #sort_fruit := 0, #sort_color := 0, #sort_company := 0) AS vars
ORDER BY sort_order, FIELD(k.id, 2, 3, 1)
Result will look like this:
| id | title | sort_order | about |
|---:|:----------------------------------------|-----------:|:--------|
| 2 | Orange: the fruit of gods | 1 | fruit |
| 3 | Theory of Colors: Orange | 1 | color |
| 1 | Orange developed the first 7G phone | 1 | company |
| 4 | How to prepare the perfect orange juice | 2 | fruit |
I think you do need some way of categorizing, or, I would prefer to say, clustering the answers. If you can do this, you can then start by showing the users the top scoring answer from each cluster. Hey, sometimes maximising diversity really is worth doing just for its own sake!
I think you should be able to cluster answers. You have some sort of scoring formula which tells you how good an answer a document is to a user query, perhaps based on a "bag of words" model. I suggest that you use this to tell how close one document is to another document by treating the other document as a query. If you do exactly this you might want to treat each document as a query with the other as an answer and average the two scores, so that the the score d(a, b) has the property that d(a, b) = d(b, a).
Now you have a score (unfortunately probably not a distance: that is, with a score, high values mean close together) and you need a clustering algorithm. Ideally you want a fast one, but maybe it just has to be fast enough to be faster than a human reading through the answers.
One fast clustering algorithm is to keep track of N (for some parameter N) cluster centres. Initialise these to the first N documents retrieved, then consider every other document one at a time. At each stage you are trying to reduce the maximum score found between any two documents in the cluster centre (which amounts to getting the documents as far apart as possible). When you consider a new document, compute the score between that document and each of the N current cluster centres. If the maximum of these scores is less than the current maximum score between the N current cluster centres, then this document is further away from the cluster centres than they are from each other so you want it. Swap it with one of the N cluster centres - whichever one makes that maximum score between the new N cluster centres the least.
This isn't a perfect clustering algorithm - for one thing, the result depends on the order in which documents are presented, which is a bad sign. It is, however, reasonably fast for small N, and it has one nice property: if you you have k <=N clusters, and (switching from scores to distances) every distance within a cluster is smaller than every distance between two points from different clusters, then the N cluster centres at the end will include at least one point from each of the k clusters. The first time you see a member of a cluster you haven't seen before, it will become a cluster centre, and you will never reduce the number of cluster centers held, because you would be ejecting a point which in a different cluster from the other centres, which won't increase the minimum distance between any two points held as cluster centres (reduce the maximum score between any two such points).
I am building a web app which will allow users to create small "spreadsheets" with their data. This is a time keeping application and I am building the reporting tool.
I am looking for a way to for different groups to have different reports without having to hard code in the algorithms. I would like to store these algorithms in a MySQL database and pull it out on a per person basis.
The application allows users to track time and assign it to activities.
The report builder will allow users to select what activities to report on and have the ability to create small equations out the totals.
for instance: One group uses an overtime-given and overtime-paid. The overtime-given needs to be multiplied by 1.5 and then I subtract the overtime-paid from this to give a total.
This is an example output:
|Month|overtime remaining| overtime-paid | overtime-given|
|-----|------------------|---------------|---------------|
| Jan | 7 | 2 | 6 |
| Feb | 9 | 0 | 6 |
| Mar | 0 | 7 | 0 |
| Apr | 7 | 2 | 6 |
| ... | ... | ... | ... |
|THour| 50 | 55 | 70 |//total in hours
|TDay | 7.14 | 7.85 | 10 |//total in days
I am not sure if the best way is to build a small interpreter and create my own tiny language to describe it, or if there is something out there already like this.
I would like to know how other people are creating per user customized reporting tools that the user themselves can alter. Altering I can build I simply UI for users to alter they would not have to code anything.
If someone could point me in the proper direction is greatly appreciated.
Thanks,
The general solution to this problem is to use a runtime expression evaluator in the form of scripts. Similar problems appear in handling complex and varying calculations like costing, quoting and payroll.
To take a simple example, the script code might read:
$total = $overtime-given * 1.5 - $overtime-paid
The questions are:
What language to use for the expressions.
How to give the language access to the variables it needs (and nothing else)
The main choices are:
Write your expression evaluator/language. Hard! (but my favorite)
Use the hosting language, if suitable. I believe PHP can be used in this role, but I defer to others as to how well that would work.
Embed a dynamically compiled/interpreted language designed for the purpose. I would suggest Lua, but it depends on whether it can be embedded on your platform.
This is not a complete answer, but rather a direction to pursue. I have done this in Ruby (embedded my own expression evaluator), but hesitate to suggest it unless you really like compiler technology.
This is for an upcoming project. I have two tables - first one keeps tracks of photos, and the second one keeps track of the photo's rank
Photos:
+-------+-----------+------------------+
| id | photo | current_rank |
+-------+-----------+------------------+
| 1 | apple | 5 |
| 2 | orange | 9 |
+-------+-----------+------------------+
The photo rank keeps changing on a regular basis, and this is the table that tracks it:
Ranks:
+-------+-----------+----------+-------------+
| id | photo_id | ranks | timestamp |
+-------+-----------+----------+-------------+
| 1 | 1 | 8 | * |
| 2 | 2 | 2 | * |
| 3 | 1 | 3 | * |
| 4 | 1 | 7 | * |
| 5 | 1 | 5 | * |
| 6 | 2 | 9 | * |
+-------+-----------+----------+-------------+ * = current timestamp
Every rank is tracked for reporting/analysis purpose.
[Edit] Users will have access to the statistics on demand.
I talked to someone who has experience in this field, and he told me that storing ranks like above is the way to go. But I'm not so sure yet.
The problem here is data redundancy. There are going to be tens of thousands of photos. The photo rank changes on a hourly basis (many times- within minutes) for recent photos but less frequently for older photos. At this rate the table will have millions of records within months. And since I do not have experience in working with large databases, this makes me a little nervous.
I thought of this:
Ranks:
+-------+-----------+--------------------+
| id | photo_id | ranks |
+-------+-----------+--------------------+
| 1 | 1 | 8:*,3:*,7:*,5:* |
| 2 | 2 | 2:*,9:* |
+-------+-----------+--------------------+ * = current timestamp
That means some extra code in PHP to split the rank/time (and sorting), but that looks OK to me.
Is this a correct way to optimize the table for performance? What would you recommend?
The first one. Period.
Actually you'll lose much more. A timestamp stored in the int column will occupy only 4 bytes of space.
While the same timestamp stored in the string format will take 10 bytes.
Your first design is correct for a relational database. The redundancy in the key columns is preferable because it gives you a lot more flexibility in how you validate and query the rankings. You can do sorts, counts, averages, etc. in SQL without having to write any PHP code to split your string six ways from Sunday.
It sounds like you would like to use a non-SQL database like CouchDB or MongoDB. These would allow you to store a semi-structured list of rankings right in the record for the photo, and subsequently query the rankings efficiently. With the caveat that you don't really know that the rankings are in the right format, as you do with SQL.
I would stick with your first approach. In the second you will have a lot of data stored in the row, as time goes by it gets more ranks! That is, if a photo gets thousands and thousands of rankings.
The first approach is also more maintainable, that is, if you wish to delete a rank.
I'd think the database 'hit' of over normalistion (querying the ranks table over and over) is nicely avoided by 'caching' the last rank in current_rank. It does not really matter ranks is growing tremendously if it is seldom queried (analyis / reporting you said), never updated but just gets records inserted at the end: even a very light box would have no problem having millions of rows in that table.
You alternative would require lots of updates on different locations on the disk, possibly resulting in degraded performance.
Of course, if you need all the old data, and always by photo_id, you could plan a scheduled run to another table rankings_old, possibly with photo_id, year,month, rankings (including timestamps) when a month is over, so retrieving old data stays easily possible, but there are no updates needed in rankings_old or rankings, only inserts at the end of the table.
And take it from me: millions of records in a pure logging table should be absolutely no problem.
Normalized data or not normalized data. You will find thousands of articles about that. :)
It really depends of your needs.
If you want to build your database only with performance (speed or RAM consumption or...) in mind you should only trust the numbers. To do that you have to profile your queries with the expected data "volume" (You can generate the data with some script you write). To profile your queries, learn how to read the results of the 2 following queries:
EXPLAIN extended...
SHOW STATUS
Then learn what to do to improve the figures (mysql settings, data structure, hardware, etc).
As a starter, I really advise these 2 great articles:
http://www.xaprb.com/blog/2006/10/12/how-to-profile-a-query-in-mysql/
http://ajohnstone.com/archives/mysql-php-performance-optimization-tips/
If you want to build for the academic beauty of the normalization: just follow the books and the general recommandations. :)
Out of the two options - like everyone before me said - it has to be option 1.
What you should really be concerned about are the bottlenecks in the application itself. Are users going to refer to the historical data often, or does it only show up for a few select users? If the answer is that everyone gets to see historical data of the ranks, then option 1 is good enough. If you are not going to refer to the historical ranks that often, then you could create a third "archive" table, and before updating the ranks, you can copy the rows of the original rank table to the archive table. This ensures that the number of rows stays minimal on the main table that is being called.
Remember, if you're updating the rows, and there's 10s of thousands, it might be more fruitful to get the results in your code (PHP/Python/etc), truncate the table and insert the results in rather than updating it row by row, as that would be a potential bottleneck.
You may want to look up sharding as well (horizontal partitioning) - http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
And never forget to index well.
Hope that helped.
You stated the rank is only linked to the image, in which case all you need is table 1 and keep updating the rank in real time. Table 2 just stores unnecessary data. The disadvantage of this approach is that user cant change his vote.
You said the second table is for analysing /statistics, so it actually isn't something that needs to be stored in db. My suggestion is to get rid of the second table and use a logging facility to record rank changes.
Your second design is very dangerous in case you have 1 million votes for a photo. Can PHP handle that?
With the first design you can do all math on the database level which will be returning you a small result set.