So!
I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.
A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.
I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.
I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.
For instance:
If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.
If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.
This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.
I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?
Thank you for your time.
So far, I have two suggestions for programmetically organizing tags into an ontology.
Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.
Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.
If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis
Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:
create a (high-dimensional) term-document matrix of your data
essential step: transform your huge sparse matrix into a lower dimension (Singular value decomposition)
every [collection of tags/terms] can then be specified by an vector in your lower dimension model
the (cosine) similarity between those two vectors is a good measurement for the similiarity of your tags, even they might not be the same stem (you may find dog and barking related)
a good input for the term-document matrix is vital
An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
Have a look at the book, it's very well written and helped me a lot with my IR thesis.
Related
Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help
One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff.
When the same question pops up very often they can add this particular question to the Faq.
In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).
What ways are there to build this kind of feature?
I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?
Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.
Thanks!
If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.
In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.
I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.
You might also want to look at term frequency–inverse document frequency.
Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.
There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.
Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html
You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:
How do you implement a "Did you mean"?
I have designed a weighted graph using a normalized adjacency list in mysql. Now I need to find the shortest path between two given nodes.
I have tried to use Dijkstra in php but I couldnt manage to implement it (too difficult for me). Another problem I felt was that if I use Dijkstra I would need to consider all the nodes, that may be perhaps very inefficient in a large graph. So does anybody has a code relating to the above problem? It would be great if somebody atleast shows me a way of solving this problem. I have been stuck here for almost a week now. Please help.
This sounds like a classic case of the A* algorithm, but if you can't implement Dijkstra, I can't see you implenting A*.
A* on Wikipedia
edit: this assumes that you have a good way to estimate (but it is crucial you don't over-estimate) the cost of getting from one node to the goal.
edit2: you are having trouble with the adjacency list representation. It occurs to me that if you create an object for each vertex in the map then you can link directly to these objects when there is a link. So what you'd have essentially is a list of objects that each contain a list of pointers (or references, if you will) to the nodes they are adjacent to. Now, if you want to access the path for a new node, you just follow the links. Be sure to maintain a list of the paths you've followed for a given vertex to avoid infinite cycles.
As far as querying the DB each time you need to access something, you're going to need to do this anyway. Your best hope is to only query the DB when you NEED to... this means only querying it when you want to get info on a specific edge in the graph, or for all edges for one vertext in the graph (the latter would likely be the better route) so you only hit the slow I/O once in a while rather than gigantic chunks all at once.
Here is a literate version of the Dijkstra algorithm, in Java, that may help you to figure out how to implement it in PHP.
http://en.literateprograms.org/Dijkstra%27s_algorithm_%28Java%29
Dijkstra algorithm returns shortest paths from given vertex to other vertexes.
You can find its pseudo-code in Wiki.
But I think you need Floyd algorithm which finds shortest paths between all vertexes in a DIRECTED grapth.
The mathematical complexity of both are pretty close.
I could find PHP implementation from the Wiki for both of them.
I would like to use named entity recognition (NER) to find adequate tags for texts in a database.
I know there is a Wikipedia article about this and lots of other pages describing NER, I would preferably hear something about this topic from you:
What experiences did you make with the various algorithms?
Which algorithm would you recommend?
Which algorithm is the easiest to implement (PHP/Python)?
How to the algorithms work? Is manual training necessary?
Example:
"Last year, I was in London where I saw Barack Obama." => Tags: London, Barack Obama
I hope you can help me. Thank you very much in advance!
To start with check out http://www.nltk.org/ if you plan working with python although as far as I know the code isn't "industrial strength" but it will get you started.
Check out section 7.5 from http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html but to understand the algorithms you probably will have to read through a lot of the book.
Also check this out http://nlp.stanford.edu/software/CRF-NER.shtml. It's done with java,
NER isn't an easy subject and probably nobody will tell you "this is the best algorithm", most of them have their pro/cons.
My 0.05 of a dollar.
Cheers,
It depends on whether you want:
To learn about NER: An excellent place to start is with NLTK, and the associated book.
To implement the best solution:
Here you're going to need to look for the state of the art. Have a look at publications in TREC. A more specialised meeting is Biocreative (a good example of NER applied to a narrow field).
To implement the easiest solution: In this case you basically just want to do simple tagging, and pull out the words tagged as nouns. You could use a tagger from nltk, or even just look up each word in PyWordnet and tag it with the most common wordsense.
Most algorithms required some sort of training, and perform best when they're trained on content that represents what you're going to be asking it to tag.
There's a few tools and API's out there.
There's a tool built on top of DBPedia called DBPedia Spotlight (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki). You can use their REST interface or download and install your own server. The great thing is it maps entities to their DBPedia presence, which means you can extract interesting linked data.
AlchemyAPI (www.alchemyapi.com) have an API that will do this via REST as well, and they use a freemium model.
I think most techniques rely on a bit of NLP to find entities, then use an underlying database like Wikipedia, DBPedia, Freebase, etc to do disambiguation and relevance (so for instance, trying to decide whether an article that mentions Apple is about the fruit or the company... we would choose the company if the article includes other entities that are linked to Apple the company).
You may want to try Yahoo Research's latest Fast entity Linking system - the paper also has updated references to new approaches to NER using neural network based embeddings:
https://research.yahoo.com/publications/8810/lightweight-multilingual-entity-extraction-and-linking
One can use artificial neural networks to perform named-entity recognition.
Here is an implementation of a bi-directional LSTM + CRF Network in TensorFlow (python) to perform named-entity recognition: https://github.com/Franck-Dernoncourt/NeuroNER (works on Linux/Mac/Windows).
It gives state-of-the-art results (or close to it) on several named-entity recognition datasets. As Ale mentions, each named-entity recognition algorithm has its own downsides and upsides.
ANN architecture:
As viewed in TensorBoard:
I don't really know about NER, but judging from that example, you could make an algorithm that searched for capital letters in the words or something like that. For that I would recommend regex as the most easy to implement solution if you're thinking small.
Another option is to compare the texts with a database, wich yould match string pre-identified as Tags of interest.
my 5 cents.
I am in search for a database with translations so I can have comonly used phrases and words translated by a machine and not by an expensive translator. Is there such a thing as a translation database with words and often used phrases?
If you don't know any would you use such a service?
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
I don't think this is enough. If you're going to translate single words, you need to have some idea of the context in which the word will be used.
For instance, consider the english word "row"
Does this mean
1. A line of things
2. An argument
3. To move a boat with oars
4. An uproar
5. Several things in succession ("they won four years in a row")
These are likely to have very different translations.
So instead, it might well be worth keeping a multi-language glossary, where you record the definition of a term and its translation in all the languages you care about, but I think you'll need a professional translator to get the translations right, and the "lookup" will always need to be manual.
Check: open-tran.eu. It is a database of translations taken from various open source projects.
http://www.google.com/language_tools
So what you want is a database phrase book? What do you want that for? You can't use a phrase book to translate books or software etc. You can't use machine translation either, even though it can be a useful tool to start with. You have to use human translators wich know the source and target-language well, preferrably a bi-lingual person.
The only thing a phrase book is good for is asking directions; and not understand the answer... ;)