How to build a 'related questions' engine? - php

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff.
When the same question pops up very often they can add this particular question to the Faq.
In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).
What ways are there to build this kind of feature?
I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?
Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.
Thanks!

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.
In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.
I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:
http://en.wikipedia.org/wiki/Tf%E2%80%93idf

I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.
You might also want to look at term frequency–inverse document frequency.

Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.

There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.
Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html

You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:
How do you implement a "Did you mean"?

Related

An Odd Tag Organization Script

So!
I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.
A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.
I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.
I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.
For instance:
If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.
If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.
This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.
I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?
Thank you for your time.
So far, I have two suggestions for programmetically organizing tags into an ontology.
Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.
Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.
If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis
Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:
create a (high-dimensional) term-document matrix of your data
essential step: transform your huge sparse matrix into a lower dimension (Singular value decomposition)
every [collection of tags/terms] can then be specified by an vector in your lower dimension model
the (cosine) similarity between those two vectors is a good measurement for the similiarity of your tags, even they might not be the same stem (you may find dog and barking related)
a good input for the term-document matrix is vital
An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
Have a look at the book, it's very well written and helped me a lot with my IR thesis.

best way to handle complex mysql search and measure how good of a match each result is

Sorry for the long title but couldn't think of a good way to put it really - i'm currently working on a large web app project and one of the main features is the detailed search, without saying too much about the project it is used to find business related deals - the search function is spread over 3 pages currently and offers pretty much every option you'd want if you were in the industry...
But the problem i've got now is that is a lot of fields and so when it comes to searching for matches in the db i don't really know the best way forward i don't think a standard mysql like is going to cut it here also i need to be able figure out how much of a fit (good match) each result is and then display that in the results (search result 1 is a 90% fit etc)
Does anyone know which is the best way to tackle this ? i know there are external search engines etc out there but don't know anything about them really to make any sort of logical choice...
Thanks !
Finding relevance in search is a complex topic that deals with many parameters. The MySQL match() search itself is pretty complex as you can see here. Perhaps you could use this score itself as your measure. You can customize this to some extent.
Another option as you mentioned is to use external search engines, something on the lines of Solr. It has all the requirements you are looking for. Its fast, scalable and able to provide customizing options to improve "relevance" for your specific needs.

PHP find relevance

Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help

What's a solid but basic search algorithm for php?

I am working on a project that involves searching for videos, these videos are tagged similar to how questions are tagged on stack overflow. I was wondering if anyone knows of a good 'tag-based' search algorithm.
Thanks!
Depending on what operations (write ? read ? both ?) you plan to use the most, there are different approaches.
Here an interesting reading: Tags: Database schemas comparing some well-known website tags schema.
How about searching tags, then titles, then descriptions, only widening the search from one method to the next if no results are found using the current method?
As an aside; if you want to return non-exact matches to your users, make sure they aren't so non-exact that they start becoming irrelevant! :)

Is there something like a translating database for strings?

I am in search for a database with translations so I can have comonly used phrases and words translated by a machine and not by an expensive translator. Is there such a thing as a translation database with words and often used phrases?
If you don't know any would you use such a service?
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
I don't think this is enough. If you're going to translate single words, you need to have some idea of the context in which the word will be used.
For instance, consider the english word "row"
Does this mean
1. A line of things
2. An argument
3. To move a boat with oars
4. An uproar
5. Several things in succession ("they won four years in a row")
These are likely to have very different translations.
So instead, it might well be worth keeping a multi-language glossary, where you record the definition of a term and its translation in all the languages you care about, but I think you'll need a professional translator to get the translations right, and the "lookup" will always need to be manual.
Check: open-tran.eu. It is a database of translations taken from various open source projects.
http://www.google.com/language_tools
So what you want is a database phrase book? What do you want that for? You can't use a phrase book to translate books or software etc. You can't use machine translation either, even though it can be a useful tool to start with. You have to use human translators wich know the source and target-language well, preferrably a bi-lingual person.
The only thing a phrase book is good for is asking directions; and not understand the answer... ;)

Categories