Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help
Related
I am reading a book and searching on the internet about this API path hirearchy and have not found anything solid yet, what I really want to know is where to put the id to retrieve/update/delete hirearchical API methods.
For instance I know I can do:
authority/resource/[id]/catalog/category1/category2
also:
authority/resource/catalog/category1/category2/[id]
in this previous example the problem comes when the next path from category2 (id) can be a numeric field to lets say update a value.
I do not really know if there is a standard about this way of building an state transfer API.
I can actually build my own and I was wondering if there are any standards or some aproach.
The "standard" allows lot's of interpretations on how you can design your hierarchy. There is not really THE way to do it.
However I think that this presentation:
https://blog.apigee.com/detail/restful_api_design
Is a good read on the topic. It outlines some design choices and also shows how some popular APIs (such as the ones offered by Google or Twitter) choose to design their URLs.
So!
I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.
A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.
I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.
I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.
For instance:
If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.
If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.
This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.
I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?
Thank you for your time.
So far, I have two suggestions for programmetically organizing tags into an ontology.
Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.
Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.
If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis
Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:
create a (high-dimensional) term-document matrix of your data
essential step: transform your huge sparse matrix into a lower dimension (Singular value decomposition)
every [collection of tags/terms] can then be specified by an vector in your lower dimension model
the (cosine) similarity between those two vectors is a good measurement for the similiarity of your tags, even they might not be the same stem (you may find dog and barking related)
a good input for the term-document matrix is vital
An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
Have a look at the book, it's very well written and helped me a lot with my IR thesis.
I want to make a searching option for my site, and for fun I decided I should at least try to make it myself (If I fail, there's always Google Custom Search).
The problem is, I don't even know how to approach this monster! Here are the requirements:
Not all keywords will be required in the search (Should one search for "Big happy world", it would also search for "Big world" "happy world" etc)
Common spelling mistakes considerations (from a database, via edit difference or a predefined list of common mistakes (rather then => rather than, etc).
Search in both content and titles of posts, with an emphesis on titles.
Don't suck
I've searched my old pal Google for it, but the only reasonable things I found were academic level papers on the subject (English isn't my native, I'm good but not that good =( ).
So in short: does anyone know of a good place to start, a tutorial, an article, an example?
Thanks in advance.
There are several options you could try:
Apache Lucene (A PHP based implementation exists in the Zend Framework)
ElasticSearch (provides a REST-like API on top of Lucene)
Xapian
Sphinx
Probably a bunch of others too.
If you want to create your own search engine, apache lucene is a mature open source library that can take care of a big part of the functionality for you.
Using lucene, you first index your information [using an IndexWriter]. This is done off line, to create the index.
On serach - you use an IndexSearcher to find documents that match your query.
If you want some theoretical knowledge on "how it works", you should read more on information retrieval. A good place to start is stanford's introduction to information retrieval
One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff.
When the same question pops up very often they can add this particular question to the Faq.
In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).
What ways are there to build this kind of feature?
I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?
Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.
Thanks!
If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.
In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.
I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.
You might also want to look at term frequency–inverse document frequency.
Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.
There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.
Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html
You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:
How do you implement a "Did you mean"?
I would like to use named entity recognition (NER) to find adequate tags for texts in a database.
I know there is a Wikipedia article about this and lots of other pages describing NER, I would preferably hear something about this topic from you:
What experiences did you make with the various algorithms?
Which algorithm would you recommend?
Which algorithm is the easiest to implement (PHP/Python)?
How to the algorithms work? Is manual training necessary?
Example:
"Last year, I was in London where I saw Barack Obama." => Tags: London, Barack Obama
I hope you can help me. Thank you very much in advance!
To start with check out http://www.nltk.org/ if you plan working with python although as far as I know the code isn't "industrial strength" but it will get you started.
Check out section 7.5 from http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html but to understand the algorithms you probably will have to read through a lot of the book.
Also check this out http://nlp.stanford.edu/software/CRF-NER.shtml. It's done with java,
NER isn't an easy subject and probably nobody will tell you "this is the best algorithm", most of them have their pro/cons.
My 0.05 of a dollar.
Cheers,
It depends on whether you want:
To learn about NER: An excellent place to start is with NLTK, and the associated book.
To implement the best solution:
Here you're going to need to look for the state of the art. Have a look at publications in TREC. A more specialised meeting is Biocreative (a good example of NER applied to a narrow field).
To implement the easiest solution: In this case you basically just want to do simple tagging, and pull out the words tagged as nouns. You could use a tagger from nltk, or even just look up each word in PyWordnet and tag it with the most common wordsense.
Most algorithms required some sort of training, and perform best when they're trained on content that represents what you're going to be asking it to tag.
There's a few tools and API's out there.
There's a tool built on top of DBPedia called DBPedia Spotlight (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki). You can use their REST interface or download and install your own server. The great thing is it maps entities to their DBPedia presence, which means you can extract interesting linked data.
AlchemyAPI (www.alchemyapi.com) have an API that will do this via REST as well, and they use a freemium model.
I think most techniques rely on a bit of NLP to find entities, then use an underlying database like Wikipedia, DBPedia, Freebase, etc to do disambiguation and relevance (so for instance, trying to decide whether an article that mentions Apple is about the fruit or the company... we would choose the company if the article includes other entities that are linked to Apple the company).
You may want to try Yahoo Research's latest Fast entity Linking system - the paper also has updated references to new approaches to NER using neural network based embeddings:
https://research.yahoo.com/publications/8810/lightweight-multilingual-entity-extraction-and-linking
One can use artificial neural networks to perform named-entity recognition.
Here is an implementation of a bi-directional LSTM + CRF Network in TensorFlow (python) to perform named-entity recognition: https://github.com/Franck-Dernoncourt/NeuroNER (works on Linux/Mac/Windows).
It gives state-of-the-art results (or close to it) on several named-entity recognition datasets. As Ale mentions, each named-entity recognition algorithm has its own downsides and upsides.
ANN architecture:
As viewed in TensorBoard:
I don't really know about NER, but judging from that example, you could make an algorithm that searched for capital letters in the words or something like that. For that I would recommend regex as the most easy to implement solution if you're thinking small.
Another option is to compare the texts with a database, wich yould match string pre-identified as Tags of interest.
my 5 cents.