Search domains by tags?

Search domains by tags? - php

I do have 100k domains with with their related tags.
I want search domains by their tags. for example google.com domain is with search,google,searchengine,engine,web,reference tags
bing.com also with search,bing,searchengine,engine,web like this I have upto 100k domains with their related tags.
Criteria 1
If I search with tags say search,google,searchengine,engine,web,reference then the both domains google.com and bing.com should display in final result.
Criteria 2
If I search with tags search,searchengine,engine,web then also both domains google.com and bing.com should appear in results
Criteria 3
If I search with tags search,searchengine then also both domains should be displayed.
Criteria 4
If only tag search then also both domains need to be display.
Criteria 5
How do I prioritize result with its tag say if I search with tags search,google,searchengine,engine,web,reference then google.com should come first and bing.com come second
Finally to achieve all these results how should I design my table and how I should query to table?
Thanks

You need to have at least two columns domain_name varchar(400), tags text -- you need to make sure all tags are comma separated.
Now, you make tags an index of type FULLTEXT to do full text search. See here for description. http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
On a quick Google, there are many posts on utilizing MySQL text search to get relevant results sorted by relevance. (Hope this is what you wanted).
One such example is here http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html It shows various ways to handle tag searches.

Related

How to optimise database search

I have the following problem.
I have a database of e.g. 1000 items. Each item can have any number of identifying tags associated with it. For purpose of question, the item and tags are purely hypothetical. So for instance, say one of the items is a DVD, then the tags for that item would be:
DVD, The Lone Ranger, western, action, family
And another DVD is tagged with:
DVD, The Magnificent 7, western, action
Now someone on my website searches for the following key words in the search box and clicks Search:
western, action, family, PG13
Both DVD's match at least 2 of the search terms, and none match the PG13. Also the first DVD's match is closest to the search terms.
The search is started and for all 1000 products I have to search through each items tags to see if they match the search criteria.
So For the first DVD, it matches 3 of the 4 tags, and for the second DVD it matches 2 of the 4 tags.
My question is, how to I optimise this search? For each item, the query looks through each items tags, then match it to the search terms. When no items matching all search terms are found, it has to "drop" one of the search terms and look to see if any item matches any 3 combinations of the 4 search terms.
Then it drops another search term and searches for 2 of the 4 search terms, trying to match any 2 combination of the 4 search terms.
It is the "dropping" of search terms and searching all possible combinations that I need to optimise. Does anyone know what the best algorithm for this would be, or can anyone provide pseudo code for this?
I have no idea on this as each scenario I try to think of I should still have to search each possible combination of search terms which while slow down the speed at which items can be returned to customers.
EDIT: I have thought about giving each item tag a weight, but the problem is that the nature of the tags are such that no tag carries more weight than any other tags. All tags are equally weighted/important.
The speed that the Database can be queried and results retuned is my biggest goal here.

As an approach, I'd explore using a left join for the search terms with a group by summing up the count each term returns. You'd then have something like:
Title, Term, Count
as the result set. Put this into a Pivot query pivoting on the values of the search terms to get:
Title, Term1, Term1Count, Term2, Term2Count,.....
You can then wrap that up in a query which eliminates those where all the *Counts are zero, and sorts it in whatever way you want.
This is not suggested as a solution, but as a path to explore.

PHP, MySQL, Efficient tag-driven search algorithm

I'm currenlty building a webshop. This shop allows users to filter products by category, and a couple optional, additional filters such as brand, color, etc.
At the moment, various properties are stored in different places, but I'd like to switch to a tag-based system. Ideally, my database should store tags with the following data:
product_id
tag_url_alias (unique)
tag_type (unique) (category, product_brand, product_color, etc.)
tag_value (not unique)
First objective
I would like to search for product_id's that are associated with anywhere between 1-5 particular tags. The tags are extracted from a SEO-friendly url. So I will be retrieving a unique strings (the tag_url_alias) for each tag, but I won't know the tag_type.
The search will be an intersection, so my search should return the product_id's that match all of the provided tags.
Second objective
Besides displaying the products that match the current filter, I would also like to display the product-count for other categories and filters which the user might supply.
For instance, my current search is for products that match the tags:
Shoe + Black + Adidas
Now, a visitor of the shop might be looking at the resulting products and wonder which black shoes other brands have to offer. So they might go to the "brand" filter, and choose any of the other listed brands. Lets say they have 2 different options (in practice, this will probably have many more), resulting in the following searches:
Shoe + Black + Nike > 103 results
Shoe + Black + K-swiss > 0 results
In this case, if they see the brand "K-swiss" listed as an available choise in their filter, their search will return 0 results.
This is obviously rather disappointing to the user... I'd much rather know that switching the "brand" from "adidas" to "k-swiss" will 0 results, and simply remove the entire option from the filter.
Same thing goes for categories, colors, etc.
In practice this would mean a single page view would not only return the filtered product list described in my primary objective, but potentially hundreds of similar yet different lists. One for each filter value that could replace another filter value, or be added to the existing filter values.
Capacity
I suspect my database will eventually contain:
between 250 and 1.000 unique tags
And it will contain:
between 10.000 and 100.000 unique products
Current Ideas
I did some Google searches and found the following article: http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
Judging by that article, running hundreds of queries to achieve the 2nd objective, is going to be a painfully slow route. The "toxy" example might work for my needs and it might be acceptable for my First objective, but it would be unacceptably slow for the Second objective.
I was thinking I might run individual queries that match 1 tag to it's associated product_id's, cache those queries, and then calculate intersections on the results. But, do I calculate these intersections in MySQL? or in PHP? If I use MySQL, is there a particular way I should cache these individual queries, or is supplying the right indexes all I need?
I would imagine it's also quite possible to maybe even cache the intersections between two of these tag/product_id sets. The amount of intersections would be limited by the fact that a tag_type can have only one particular value, but I'm not sure how to efficiently manage this type of caching. Again, I don't know if I should do this in MySQL or in PHP. And if I do this in MySQL, what would be the best way to store and combine this type of cached results?

Using sphinx search engine can make this magic for you. Its is VERY fast, and even can handle wordforms, what can be useful with SEO requests.
In terms of sphinx, make a document - "product", index by tags, choose proper ranker for query (ex, MATCH_ALL_WORDS) and run batch request with different tag combinations to get best results.
Dont forget to use cachers like memcahed or any other.

I did not test this yet, but it should be possible to have one query to satisfy your second objective rather than triggering several hundred queries...
The query below illustrates how this should work in general.
The idea is to combine the three different requests at once and group by the dedicated value and collect only those which have any results.
SELECT t1.product_id, count(*) FROM tagtable t1, tagtable t2, tagtable t3 WHERE
t1.product_id = t2.product_id AND
t2.product_id = t3.product_id AND
t1.tag_type='yourcategoryforShoe' AND t1.tag_value='Shoe' AND
t2.tag_type='product_color' AND t2.tag_value='Black' AND
t3.tag_type='brand'
GROUP BY t3.tag_value
HAVING count(*) > 0

omit search results

using foursquare api php
I am performing a search for venues with nightlife categories:
$params = array("near"=>"92101", "radius"=>"800", "intent"=>"checkin",
"categoryId"=>"4d4b7105d754a06376d81259", "limit"=>"50");
$venues = $foursquare->GetPublic("venues/search", $params);
works as expected...kind of. the problem is restaurants that have been sub categorized as bars are filling up my return limit. so in that search i may only get a few actual nightlife venues. it would be very helpful if i could omit venues that have certain categories. get 50 nightlife venues but not the ones also labeled as food.
i have searched around and keep re-reading the search endpoint page hoping i overlooked the omit feature. any help?

We have had the same problem (different category types)
What we ended up doing is performing several searches with specific categories. The categoryId field accepts multiple comma delimited categories, so we executed sometimes up to 3 searches with multiple categoryIds.
So in stand of asking for a single category, your request would look like (no 'bars', i just picked a couple of random nightlife categories):
$params = array("near"=>"92101", "radius"=>"800", "intent"=>"checkin",
"categoryId"=>"4bf58dd8d48988d11f941735,4bf58dd8d48988d121941735,...", "limit"=>"50");
And then do another request with the general nightlife
$params = array("near"=>"92101", "radius"=>"800", "intent"=>"checkin",
"categoryId"=>"4d4b7105d754a06376d81259", "limit"=>"50");
And merge the results.
Two things to note with this solution:
You may (probably) get overlapping results from multiple searches, as venues sometimes have more than one category (as you found out already), remember to handle the multiple results. We first scanned all the results and kept on the unique ones according to the foursquare ID, then started our processing.
This solution does not scale well with the foursquare API - doing 40 searches will not work.. (but there is no other way of getting what you need without it, so I am still writing the entire solution here)

Content Tagging

I'm trying to create a small Web App to categorize certain type of YouTube videos, when users submits a video they will choose what categories this video falls under and they will tag it with ready-made tags, for example:
Video one - Category: Ad - Tags: cute, funny, has animal in it.
I'm trying to sketch my Database for that (I'm using MySQL), so far I have two ideas.
Idea 1:
Table Videos with ID and Category columns, another table Tags with ID and Tag columns while Videos.ID and Tags.ID are linked together. So when the user tries to filter search results by tags, the query will have more conditions (AND Tag = 'something' AND Tag = 'other thing').
Idea 2:
One table Videos with Category and Tags columns, tags are stored as a string separated by commas, when the user tries to filter search results by tags, the query will more conditions (AND Tags LIKE '%something%' AND Tags LIKE '% other thing%).
So the question is: Is there any better method? I already think that the 1st one is wasteful (Each video might have up to 40 ready-made tags) and the 2nd one is clumsy. If not, which one do you think is better?

Creating a additional table linking video id and tag id together is the correct solution. Filtering is done by creating additional INNER JOIN conditions. A comma separated list just won't do - it drastically limits your selection and query possibilities.

Idea 1 looks good. Creating a separate table for storing tags helps in selection.

Weighing search results

PHP / MySQL backend. I've got a database full of movies YouTube-style. Each video has a name and category. Videos and categories have a m:n relationship.
I'd like for my visitors to be able to search for videos and have them enter the search terms in one search field. I can't figure out how to return the best search results based on being category, occurrences in name.
What's the best way to go about something like this? Scoring? => Check for each search term whether it occurs in the name of the video; if so, award the video a point; check if the video is in categories that are also contained in the search query; if so, award it a point. Sort it by number points received? That sounds very expensive in terms of CPU usage.

Using Full-Text Search may help: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html#function_match
You can test several columns at once against an expression.

First, use full text search. It can be either MySql full-text search or some kind of extrenal full-text search engine. I recommend sphinx. It is very fast, simple and even can be integrated with MuSQL using SphinxSE (so search indexes look loke tables in MySQL). However you have to install and configure it.
Second, think about splitting search results by search type. Any kind of full-text search will return list of matched items sorted by relevancy. You can search by all fields and get a single list. This is bad idea because hits by name and hits by category will be mixed. To solve this you can do multiple searches - search by name first, then search by category.
As a result you'll have two matching sets and you have a lot of options how to display this. Some ideas:
merge 2 sets based on relevancy rate returned by the search engine. This looks like result of one single query but you know what each item is (name hit or category hit) so you can highlight this
do the same marge as above but assign different weights to different sets, for eaxmple relevancy = 0.7*name_relevancy+0.3*category_relevancy. This will make search results more natural
spit results into tabs/groups e.g. 'There are N titles and M categories matching your query)
Use bands when displaying results. For each page (assuming you are splitting search results using paginator) dispslay N items from the first set and M items from the second set (you can dipslya sets one by one or shuffle items). If there is no enough items in one of sets then just get more items from another set, so there is always M+N items per page
Any other way you can imagine
And you can use this method for any kind of fields - name, categroy, actor, director, etc. However the more fields you use the more search queries you have to execute

I don't think you can avoid looking at the title and category of every movie for each search. So the CPU usage for that is a given. If you are concerned about the CPU usage of the sort, it would be negligible in most cases, since you would only be sorting the items that have more than zero points.
Having said that, what you probably want is a system that is partially rule-based and partially point-based. For instance, if you have a title that is equal to the search term, it should come first, regardless of points. Architect your search such that you can easily add rules and tweak points as you see fit to yield the best results.
Edit: In the event of an exact title match, you can take advantage of a DB index and not search the whole table. Optionally, the same goes for category.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.