we often see 'related items'. For instance in blogs we have related posts, in books we have related books, etc. My question is how do we compile those relevency? If it's just tag, I often see related items that does not have the same tag. For instance, when search for 'pink', a related item could have a 'purple' tag.
Anyone has any idea?
There are many ways to calculate similarity of two items, but for a straightforward method, take a look at the Jaccard Coefficient.
http://en.wikipedia.org/wiki/Jaccard_index
Which is: J(a,b) = intersection(a,b)/union(a,b)
So lets say you want to compute the coefficient of two items:
Item A, which has the tags "books, school, pencil, textbook, reading"
Item B, which has the tags "books, reading, autobiography"
intersection(A,B) = books, reading
union(A,B) = books, school, pencil, textbook, reading, autobiography
so J(a,b) = 2/6 = .333
So the most related item to A would be the item which results in the highest Jaccard Coefficient when paired with A.
Here are some of the ways:
Manually connecting them. Put up a table with the fields item_id and related_item_id, then make an interface to insert the connections. Useful to relate two items that are related but have no resemblance or do not belong to the same category/tag (or in an uncategorized entry table). Example: Bath tub and rubber ducky
Pull up some items that belong to the same category or have a similar tag. The idea is that those items must be somewhat related since they are in the same category. Example: in the page viewing LCD monitors, there are random LCD monitors (with same price range/manufacturer/resolution) in the "Related items" section.
Do a text search matching current item's name (and or description) against other items in the table. You get the idea.
To get a simple list of related items based on tags, the basic solutions goes like this:
3 tables, one with items, one with tags and one with the connection. The connection table consists of two columns, one for each id from the remaining tables. An entry in the connection table links a tag with an item by putting their respective ids in a row.
Now, to get that list of related items.
fetch all items which share at least one tag with the original item. be sure to fetch the tags along with the items, and then use a simple rating mechanism to determine, which item shares the most tags with the original one. each tag increases the relation-relevancy by one.
Depending on your tagging-habits, it might be smart to add some counter-mechanism to prevent large overarching tags from mixing up the relevancy. to achieve this, you could give greater weight to tags below a certain threshold of appliances. A threshold which has generally worked nicely for me, is total_number_of_tag_appliances/total_number_of_tags, which results in the average number of appliances. If the tags appliance-count is smaller than average, the relation-relevancy is increased double.
It can be more than a tag, for example it can be average of each work appearing in a paragraph, and then titles, etc
I would say they use ontology for that which adds more great features to the application.
it can also be based on "people who bought this book also bought"
No matter how, you will need some dort of connection between your items, and they will mostly be made by human beings
This is my implementation(GIST) of Jaccard index with PostgreSQL, and Ruby on Rails...
Here is an implementation of jaccard index between two texts based on bigrams.
https://packagist.org/packages/darkopetreski/textcategorization
Related
I have the following problem.
I have a database of e.g. 1000 items. Each item can have any number of identifying tags associated with it. For purpose of question, the item and tags are purely hypothetical. So for instance, say one of the items is a DVD, then the tags for that item would be:
DVD, The Lone Ranger, western, action, family
And another DVD is tagged with:
DVD, The Magnificent 7, western, action
Now someone on my website searches for the following key words in the search box and clicks Search:
western, action, family, PG13
Both DVD's match at least 2 of the search terms, and none match the PG13. Also the first DVD's match is closest to the search terms.
The search is started and for all 1000 products I have to search through each items tags to see if they match the search criteria.
So For the first DVD, it matches 3 of the 4 tags, and for the second DVD it matches 2 of the 4 tags.
My question is, how to I optimise this search? For each item, the query looks through each items tags, then match it to the search terms. When no items matching all search terms are found, it has to "drop" one of the search terms and look to see if any item matches any 3 combinations of the 4 search terms.
Then it drops another search term and searches for 2 of the 4 search terms, trying to match any 2 combination of the 4 search terms.
It is the "dropping" of search terms and searching all possible combinations that I need to optimise. Does anyone know what the best algorithm for this would be, or can anyone provide pseudo code for this?
I have no idea on this as each scenario I try to think of I should still have to search each possible combination of search terms which while slow down the speed at which items can be returned to customers.
EDIT: I have thought about giving each item tag a weight, but the problem is that the nature of the tags are such that no tag carries more weight than any other tags. All tags are equally weighted/important.
The speed that the Database can be queried and results retuned is my biggest goal here.
As an approach, I'd explore using a left join for the search terms with a group by summing up the count each term returns. You'd then have something like:
Title, Term, Count
as the result set. Put this into a Pivot query pivoting on the values of the search terms to get:
Title, Term1, Term1Count, Term2, Term2Count,.....
You can then wrap that up in a query which eliminates those where all the *Counts are zero, and sorts it in whatever way you want.
This is not suggested as a solution, but as a path to explore.
I'm currenlty building a webshop. This shop allows users to filter products by category, and a couple optional, additional filters such as brand, color, etc.
At the moment, various properties are stored in different places, but I'd like to switch to a tag-based system. Ideally, my database should store tags with the following data:
product_id
tag_url_alias (unique)
tag_type (unique) (category, product_brand, product_color, etc.)
tag_value (not unique)
First objective
I would like to search for product_id's that are associated with anywhere between 1-5 particular tags. The tags are extracted from a SEO-friendly url. So I will be retrieving a unique strings (the tag_url_alias) for each tag, but I won't know the tag_type.
The search will be an intersection, so my search should return the product_id's that match all of the provided tags.
Second objective
Besides displaying the products that match the current filter, I would also like to display the product-count for other categories and filters which the user might supply.
For instance, my current search is for products that match the tags:
Shoe + Black + Adidas
Now, a visitor of the shop might be looking at the resulting products and wonder which black shoes other brands have to offer. So they might go to the "brand" filter, and choose any of the other listed brands. Lets say they have 2 different options (in practice, this will probably have many more), resulting in the following searches:
Shoe + Black + Nike > 103 results
Shoe + Black + K-swiss > 0 results
In this case, if they see the brand "K-swiss" listed as an available choise in their filter, their search will return 0 results.
This is obviously rather disappointing to the user... I'd much rather know that switching the "brand" from "adidas" to "k-swiss" will 0 results, and simply remove the entire option from the filter.
Same thing goes for categories, colors, etc.
In practice this would mean a single page view would not only return the filtered product list described in my primary objective, but potentially hundreds of similar yet different lists. One for each filter value that could replace another filter value, or be added to the existing filter values.
Capacity
I suspect my database will eventually contain:
between 250 and 1.000 unique tags
And it will contain:
between 10.000 and 100.000 unique products
Current Ideas
I did some Google searches and found the following article: http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
Judging by that article, running hundreds of queries to achieve the 2nd objective, is going to be a painfully slow route. The "toxy" example might work for my needs and it might be acceptable for my First objective, but it would be unacceptably slow for the Second objective.
I was thinking I might run individual queries that match 1 tag to it's associated product_id's, cache those queries, and then calculate intersections on the results. But, do I calculate these intersections in MySQL? or in PHP? If I use MySQL, is there a particular way I should cache these individual queries, or is supplying the right indexes all I need?
I would imagine it's also quite possible to maybe even cache the intersections between two of these tag/product_id sets. The amount of intersections would be limited by the fact that a tag_type can have only one particular value, but I'm not sure how to efficiently manage this type of caching. Again, I don't know if I should do this in MySQL or in PHP. And if I do this in MySQL, what would be the best way to store and combine this type of cached results?
Using sphinx search engine can make this magic for you. Its is VERY fast, and even can handle wordforms, what can be useful with SEO requests.
In terms of sphinx, make a document - "product", index by tags, choose proper ranker for query (ex, MATCH_ALL_WORDS) and run batch request with different tag combinations to get best results.
Dont forget to use cachers like memcahed or any other.
I did not test this yet, but it should be possible to have one query to satisfy your second objective rather than triggering several hundred queries...
The query below illustrates how this should work in general.
The idea is to combine the three different requests at once and group by the dedicated value and collect only those which have any results.
SELECT t1.product_id, count(*) FROM tagtable t1, tagtable t2, tagtable t3 WHERE
t1.product_id = t2.product_id AND
t2.product_id = t3.product_id AND
t1.tag_type='yourcategoryforShoe' AND t1.tag_value='Shoe' AND
t2.tag_type='product_color' AND t2.tag_value='Black' AND
t3.tag_type='brand'
GROUP BY t3.tag_value
HAVING count(*) > 0
I'm searching algorithm, or some fitness rating method.
As an example take Stackoverflow. Posts are divided to groups by
Rating (+,-,0)
Tags (and tags importance based on activity in them)
Users (user rating/reputation, age, recent activity)
Keywords
And I'm looking for way, how to sort them to create optimized/balanced mix.
I don't want to show ONLY the newest OR ONLY the top rated OR ONLY important tags
maybe the name would be "Multiple-attributes optimal sorting", or something similar.
Anyone can advise something?
Thanks
ADD1: maybe we are talking about Fitness function ( http://en.wikipedia.org/wiki/Fitness_function )
Generate separate sub-scores for each of those factors, then normalize them, add them together, and sort by the resulting total for each post. For instance,
Rank all of the posts by rating, and then map their position in the ranking to a 0.0-1.0 range (highest rated post is 1.0, lowest is 0.0).
Create a function to take a post's tags and calculate a similar 0.0-1.0 score based on tags only.
Create another function to do the same for the user.
And another for any keywords you want.
If you want certain things to factor in more than others, multiply the subscore by a constant factor before adding it to the total - for instance, if you want rating to be important, and the others less so, you might do (3*A)+B+C+D if the four subscores are the letters.
As for exactly how you translate things into subscores? That's something you really have to determine for your particular app; there's no single way of doing it that is "right".
I have millions of songs, each song has its unique Song ID. Corresponding to each Song ID I have some attributes like song name, artist name, album name, year etc.
Now, I have implemented a mechanism to find out similarity ratio between two songs.
It gives me a value between 0 - 100.
So, I need to show similar music to users, which can not be done on a run time. I need to preprocess the similarity values between each and every song.
Hence, if I create a DB with three attributes,
song1, song2, similarity
I will be having n*n records where n is the number of songs.
And whenever I want to fetch the similar music, I need to execute this query:
SELECT song2 WHERE song1 = x AND similarity > 80 ORDER BY similarity DESC;
Please suggest something to maintain such information.
Thanks.
I think you'd be better off comparing similarity to a "prototypical" song or classification. Devise a fingerprint mechanism that includes information metadata about the song and whatever audio mechanism you use to judge similarity. Place each song into one (or more) categories and score the song within that category -- how closely does it match the prototype for the category using the fingerprint. Note that you could have hundreds or thousands of categories, i.e., they're not the typical categories that you think of when you think of music.
Once you have this done, you can then maintain indexes by category and when finding similar songs you devise a weight based on the category and similarity measures within the category -- say by giving greater weight to the category in which the song is closest to the prototype. Multiply the weight by the square of the difference between the candidate song and the current song to the prototype for the category. Sum the weights for the say top 3 categories with lower values being more similar.
This way you only need to store a few items of metadata for each song rather than keep relationship between pairs of songs. If the main algorithm runs too slowly, you could keep cached pair-wise data for the most common songs and default to the algorithmic comparison when a song isn't in your cached data set.
What you are proposing will work, however, you can reduce the number of rows by storing each pair only once. Then modifying your query to select the song id in song1 or song2.
Something like:
SELECT if(song1=?,song2,song1) as similar WHERE (song1 = ? or song2 =?) AND similarity > 80 ORDER BY similarity DESC;
It seems required mass computation power to maintain and access the similarity information. For example, if you already have 2000 songs processed, and you still need to perform the similarity analyze 2000 times for the next new song. It may have scalability problem and the data scheme can make the database slow in just a short time period.
I recommend that you can find some pattern and tag each song. For example, you can analyze the songs for "blues", "rocks", "90's" pattern and give them tags. If you want to find similar song based on one song, you can just query all tags that the given songs have. ex. "New age", "Slow" and "techno"
So I got a problem that I can't wrap my mind around.
I'm creating a shopping list that is divided into ten categories of various lengths. (All of the items come from a database). I got it to work when using a single column, but I have to divide the list into four columns. The code should decide which categories should go where so that the four columns have the most equal number of items possible.
This is what the list will look like when the code is working.
Out of these ten categories, four of them have a specific category they belong to.
The way I've approached this is to count the total number of items and divide it by four to compute the average number of items per column. I put the four special categories in their respective column and kept track of how many items were now in each column.
Now I still have six columns remaining of various sizes. What is the best approach to put them in the column that would fit best? Since some categories are much larger than others, some columns could potentially have three or four categories.
UPDATE: Right after I posted this I came to the realization that I should find the column with the least items and add the largest category to it. This seems like it will work. And it looks like Dave is suggesting the same!
After writing your 4 "main" categories to the columns, make an array that has a total of each column:
$columnTotals = array(10,6,12,13)
//example - obviously you'd use count or something to get the totals
Then, order your non-special categories in an array by largest to smallest:
$subcatTotals = array(18,15,13,12,8,4);
//here, you'll have to get the totals, then use an array sort to order them
//probably want an associative array so you know which total matches which cat.
Then, in a loop, add the first(largest) sub-category to the smallest column, and get a new total for that column.
This SHOULD give you the most even columns you can get - at least it has in all the made-up examples I've tried it with.
Your approach is most ideal in today's context. Let me explain...
The ideal thing to do right now is do your little calculation and split the list into the number of rows & columns.
The alternative is a CSS3 approach. i.e., you can create the whole list in ONE column through PHP. And on the CSS side, you can specify the new property "column-count".
But there are issues. This is not yet properly standardised. So you've got to specify the -moze- prefix and -webkit- prefix depending on your browser. But the reason I wouldn't go for this is that IE still does not support this. And it's too early to consider an upgrade by all users even if they did.
Going one step further, you ought to modify your splitting algorithm to take into account the category headings.
Hope this helps :)