I am working on a project that involves searching for videos, these videos are tagged similar to how questions are tagged on stack overflow. I was wondering if anyone knows of a good 'tag-based' search algorithm.
Thanks!
Depending on what operations (write ? read ? both ?) you plan to use the most, there are different approaches.
Here an interesting reading: Tags: Database schemas comparing some well-known website tags schema.
How about searching tags, then titles, then descriptions, only widening the search from one method to the next if no results are found using the current method?
As an aside; if you want to return non-exact matches to your users, make sure they aren't so non-exact that they start becoming irrelevant! :)
Related
We're building an application that has multiple different entities that are pretty simple with name & description and some specific stuff.
Now we want to add tags, for the purpose of adding extra search keywords. (There will also be a tag cloud somewhere, but that's easy)
I've been reading up on different ways to do a proper search. Solutions like lucene, elastic, mysql fulltext match against and more.
Does anyone have any experience to share on the best solution for an application like this?
Should I put the tags in the same table in a string/array field? Use a seperate table? I've also found DoctrineExtensions-Taggable which seems pretty decent and will make it easy to make a tagcloud too.
For a proper search over name, description & tags, what's the best solution? lucene? elastic? myssql? So far I think the FOSElasticaBundle looks the most mature, but not sure how to add search on tags there? (see 1)
Thanks for the advice!
"What's the best..." is usually not a good way to ask here and you will either get very opinionated answers or none at all.
So here is my opinionated answer:
If you have a small set you can surely run with MySQL and MATCH AGAINST (you need to teach Doctrine how to use it!). It's the most straight forward thing to do because you already have Doctrine and just need to teach it a little bit of stuff.
So yeah, Gedmo Taggable and a bit of your own Doctrine Extension and you have your Queries.
If your Searchable Database will get larger you will want to switch to a proper search engine like Elastic or Solr or whatever else.
What you will need there is usually called "Faceted Search" (or in ElasticSearch they are called "Aggregations" nowadays).
More infos about Faceted Search you find on Wikipedia for example.
Yes, a proper Search Engine is cooler and faster and flashy, but if you work on it on a schedule and for the first time, it might not be the best solution.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I don't know, if this type of question is already asked or not. Actually I don't know what to search for. Am asking at the right place?
Just as an example, I always wonder how the social media giants like Facebook manages their user settings module... What would be the database design and how they manage to hide the user updates on his friends' timeline if he has chosen not to show his updates on that particular friends timeline. I mean if I had been programming there then I would have loaded all the settings value in an array and there would be many conditional statements to check each and every user setting and accordingly printed data.
But I think this would make that code unmanageable because there would be so many conditions which could lead to undesired results.
So my question is, is there any better approach to do this?
I don't know I am making any sense here, but I tried to explain my question.
Facebook's data is maintained in document repository (Nosql) and efficient indexing is used to quickly find the tags and searches. This approach of search and data storage is markedly different from relational database based data storage and search.
Google also uses similar scheme to map the entire web and promptly give you back the result.
So in simpler terms you data is stored and indexed the way Google indexes messages, only difference is, the data is also lying with Facebook.
The related technologies are Bigdata, Mongodb, Apache Hadoop. And one of the leading index management and search algorithm is Lucene. Apache Elasticsearch is an user friendly package around Lucene.
So facebook treats these security critaria like tag (in simple language) and does google like search and presents you in a pleasing frontend, not sounding like a search engine.
While setting up your system, you can use elasticsearch to have faster search. Elasticsearch is makes implementation of lucene easier. It definitely will have some learning curve. Elasticsearch can also be used along with rdbms, in this case your data is saved in database but indexes also maintained to faster search. Definitely the cost would be disk-space. It makes it possible to have many criteria but still being able to get result quicker.
A quick tutorial on elasticsearch.
There would be many conditions to evaluate, that is correct. But in a SELECT statement you can easily compose all of those conditions in a WHERE clause which is very efficient.
Essentially, as long as you're comparing on equality, the database can easily optimize that, allowing it to quickly search for posts that fit the desired constraints. Even though there are a lot conditions, they don't really affect performance when compared to the fact that there are millions of entries in a table to be searched.
What your asking for is a result of really tough planing.. whenever you need to develop something that has a good potential to be complex you'll have to plan (Engineering) it well using known methodologies.
Usually the DB has many polymorphic relationships with entities, there are guys who are responsible of writing Query Procedures that should retrieve the wanted data fairly for the developers.
It's really not something you could come up with easy solution, the key here is planning, and planning good. there's no one right answer.
If your application is fairly small, you could just implement it your way, then you'll see what can be upgraded.. It's pretty much your only way to go. (BTW that's what most of statups are doing)
I wish you the best of luck.
Regarding facebook's db schema's and how it works and why its a good design, here are some articles that would explain to you why:
The power of the graph
This is posted by facebook and it explains how they are managing data. They use TAO data model and through the application of graph theory and other complicated algorithms and advanced memoray caching and data handling, they can efficiently manage lots of user data..
but regarding to your question: What would be the database design and how they manage to hide the user updates on his friends' timeline if he has chosen not to show his updates on that particular friends timeline?
I think this post would give you some insights on what kind of db structure facebook has and what would be the functionality of it for every user: Social Network Friends Relationship Database Design
Usually, the hiding of user updates on your friends' timeline if you have not shown your update to that particular friend is managed by storing values in the database.. you can create a view_type table in db and that would determine what kind of view the user can see, then issue a where condition in your sqls based on the view the user has selected.. there are still many ways to handle this and a good database structure is needed for this and of course planning for a good and efficient database is a very important and strict procedure..
Sorry for the long title but couldn't think of a good way to put it really - i'm currently working on a large web app project and one of the main features is the detailed search, without saying too much about the project it is used to find business related deals - the search function is spread over 3 pages currently and offers pretty much every option you'd want if you were in the industry...
But the problem i've got now is that is a lot of fields and so when it comes to searching for matches in the db i don't really know the best way forward i don't think a standard mysql like is going to cut it here also i need to be able figure out how much of a fit (good match) each result is and then display that in the results (search result 1 is a 90% fit etc)
Does anyone know which is the best way to tackle this ? i know there are external search engines etc out there but don't know anything about them really to make any sort of logical choice...
Thanks !
Finding relevance in search is a complex topic that deals with many parameters. The MySQL match() search itself is pretty complex as you can see here. Perhaps you could use this score itself as your measure. You can customize this to some extent.
Another option as you mentioned is to use external search engines, something on the lines of Solr. It has all the requirements you are looking for. Its fast, scalable and able to provide customizing options to improve "relevance" for your specific needs.
Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help
One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff.
When the same question pops up very often they can add this particular question to the Faq.
In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).
What ways are there to build this kind of feature?
I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?
Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.
Thanks!
If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.
In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.
I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.
You might also want to look at term frequency–inverse document frequency.
Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.
There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.
Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html
You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:
How do you implement a "Did you mean"?