Algorithms for named entity recognition - php

I would like to use named entity recognition (NER) to find adequate tags for texts in a database.
I know there is a Wikipedia article about this and lots of other pages describing NER, I would preferably hear something about this topic from you:
What experiences did you make with the various algorithms?
Which algorithm would you recommend?
Which algorithm is the easiest to implement (PHP/Python)?
How to the algorithms work? Is manual training necessary?
Example:
"Last year, I was in London where I saw Barack Obama." => Tags: London, Barack Obama
I hope you can help me. Thank you very much in advance!

To start with check out http://www.nltk.org/ if you plan working with python although as far as I know the code isn't "industrial strength" but it will get you started.
Check out section 7.5 from http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html but to understand the algorithms you probably will have to read through a lot of the book.
Also check this out http://nlp.stanford.edu/software/CRF-NER.shtml. It's done with java,
NER isn't an easy subject and probably nobody will tell you "this is the best algorithm", most of them have their pro/cons.
My 0.05 of a dollar.
Cheers,

It depends on whether you want:
To learn about NER: An excellent place to start is with NLTK, and the associated book.
To implement the best solution:
Here you're going to need to look for the state of the art. Have a look at publications in TREC. A more specialised meeting is Biocreative (a good example of NER applied to a narrow field).
To implement the easiest solution: In this case you basically just want to do simple tagging, and pull out the words tagged as nouns. You could use a tagger from nltk, or even just look up each word in PyWordnet and tag it with the most common wordsense.
Most algorithms required some sort of training, and perform best when they're trained on content that represents what you're going to be asking it to tag.

There's a few tools and API's out there.
There's a tool built on top of DBPedia called DBPedia Spotlight (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki). You can use their REST interface or download and install your own server. The great thing is it maps entities to their DBPedia presence, which means you can extract interesting linked data.
AlchemyAPI (www.alchemyapi.com) have an API that will do this via REST as well, and they use a freemium model.
I think most techniques rely on a bit of NLP to find entities, then use an underlying database like Wikipedia, DBPedia, Freebase, etc to do disambiguation and relevance (so for instance, trying to decide whether an article that mentions Apple is about the fruit or the company... we would choose the company if the article includes other entities that are linked to Apple the company).

You may want to try Yahoo Research's latest Fast entity Linking system - the paper also has updated references to new approaches to NER using neural network based embeddings:
https://research.yahoo.com/publications/8810/lightweight-multilingual-entity-extraction-and-linking

One can use artificial neural networks to perform named-entity recognition.
Here is an implementation of a bi-directional LSTM + CRF Network in TensorFlow (python) to perform named-entity recognition: https://github.com/Franck-Dernoncourt/NeuroNER (works on Linux/Mac/Windows).
It gives state-of-the-art results (or close to it) on several named-entity recognition datasets. As Ale mentions, each named-entity recognition algorithm has its own downsides and upsides.
ANN architecture:
As viewed in TensorBoard:

I don't really know about NER, but judging from that example, you could make an algorithm that searched for capital letters in the words or something like that. For that I would recommend regex as the most easy to implement solution if you're thinking small.
Another option is to compare the texts with a database, wich yould match string pre-identified as Tags of interest.
my 5 cents.

Related

Algorithm for Dating Website

Out of curiosity, i'm wondering if anyone has had any experience writing a Dating Website. I hear the word algorithm used alot but have never really came across a situation where i've needed to use one (or so i thought). I also hear that people use algorithms for Dating websites to find matches for people? What sort of language to these sites use for their logic? PHP perhaps?
My question in summary is could you use PHP to construct a Dating website and use "algorithms" to find matches for people or is that not how its done?
An algorithm is a logical construct that performs a task.
If a dating website offers any sort of functionality to "match" people, whether it is calculating compatibility or doing searches based on some sort of "suitability" parameter then it is using an "algorithm" of some sort.
If a dating website merely lets users search the database based on the data entered (where they live, gender, etc. etc.) then it is not using any sort of algorithm beyond those used internally in its components.
The answer to your question in summary is "yes" and "whatever". Yes you could use PHP and you could use "algorithms" to find matches for people. As to "how it is done", I imagine that there is no one single way currently implemented, and certainly there is always room for someone to invent a new way of doing it even if that is how "it is done". Don't feel constrained by custom.
I hear the word algorithm used alot but have never really came across a situation where i've needed to use one
Then you've never written a computer program, never planned anything advance, never solved a mathematical problem, never cooked using a recipe?
I think you've not understood what 'algorithm' means.
Yes you could use PHP or lots of other programming languages and tools for this purpose. Computers and programming languages are just one way to implement algorithms - and a computer/program is by definition an implementation of an algorithm.
There's a nice chapter in "Programming Collective Intelligence" that talks about one algorithm in great detail. The examples are all in Python, but any language could be used.
Yes you can and you should. For example, Facebook which isn't really a dating website tries to find your "possible friends" by comparing common friends with another user. If you have lets say 50 or more friends in common, that means you may also be friends.
For a dating website, where you live, age, interests ... and many more can be added to the algorithm.
I wrote a Dating site in PHP, combined with an "adult store" (which I won't put the link to here...).
But yeah, that's how I did it - Used PHP functions to "match" people...

PHP find relevance

Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help

Using Visio (Dia) to Map Out Algorithms

I was just wondering how many experienced programers out there actually map out their process or algorithms in a program like MS Visio or Gnome Dia?
I am trying to code some complex PHP for my website and just seem to be missing something. Is a diagram program going to help or should I be looking in another area?
I use Visio only for quick graph that doesn't need to follow UML rules. Sometime useful for documentation that aren't about the detail of the code : if you need to show some high abstract view of your code, Visio do the job, Example, documentation that display how each big part communicate, or a simple activity diagram...
You can find a SO list of free UML editor if you require to do intensive UML design.
Everytime I've tried to make a truly usage diagram in Visio, it always ends up being more work than it's worth. Never underestimate the power of pencil & paper, or better yet, a white board.
But yes, explaining or writing out your problems will more quickly lead to a solution than merely sitting there and thinking about it.
OmniGraffle. Class diagrams. Sequence diagrams. Interaction diagrams. 'Nuff said.
When I want to make a sketch with 3 boxes and a handful of arrows I use graphviz.
I hate graphical stuff where you have to realign everything each time you change a name.
It's (nearly) as simple as writing :
Input -> Frobnicator -> Output
in a text file then run "dot -Tpng -O myfile"
give it a try ...
but be warned that graphical representation just work for very high level views (i.e. with few objects)
I use magicdraw to chart out my use cases (so my team and I understand the features needed exactly) and then I do activity charts and class diagrams for the more complex features. You can also do database architecture in there and have it generate the sql for your (a god send if you're database is huge). Magicdraw isn't free however but if you anticipate doing a fair amount of complex projects it might be worth the investment. Outside of going the diagramming route you can look into using a PHP framework that might take care of some stuff for e.g. Zend Framework, or Code Ignitor

Is there something like a translating database for strings?

I am in search for a database with translations so I can have comonly used phrases and words translated by a machine and not by an expensive translator. Is there such a thing as a translation database with words and often used phrases?
If you don't know any would you use such a service?
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
edit: the database should only be monitored by people and not some automatic translater since they tend to be VERY bad
I don't think this is enough. If you're going to translate single words, you need to have some idea of the context in which the word will be used.
For instance, consider the english word "row"
Does this mean
1. A line of things
2. An argument
3. To move a boat with oars
4. An uproar
5. Several things in succession ("they won four years in a row")
These are likely to have very different translations.
So instead, it might well be worth keeping a multi-language glossary, where you record the definition of a term and its translation in all the languages you care about, but I think you'll need a professional translator to get the translations right, and the "lookup" will always need to be manual.
Check: open-tran.eu. It is a database of translations taken from various open source projects.
http://www.google.com/language_tools
So what you want is a database phrase book? What do you want that for? You can't use a phrase book to translate books or software etc. You can't use machine translation either, even though it can be a useful tool to start with. You have to use human translators wich know the source and target-language well, preferrably a bi-lingual person.
The only thing a phrase book is good for is asking directions; and not understand the answer... ;)

General programming knowledge?

I am completely new to programming - my interest lies in PHP & MySql for building a dynamic web application for Military Band Administration purposes. i.e. General info and social networking for members + added functionality for the management team to communicate effectively.
OK so the question - as I learn more about PHP there are terms used that I do not understand that must come from a common basis of familiarity between all languages i.e. "stack overflow" appears to be an obvious one - "using too many recursive functions may smash the stack" is another.
So is there a book (a primer perhaps) about programming in general which allows someone like me to have a better understanding of what all this means?
Bear in mind I am 57 years old (young) and am really just starting out.
Steve
Wikipedia is probably your best resource for general information on programming terminology. A large segment of their community overlaps with the programming community, so tech-related pages are normally very accurate, educational, and up to date. See their pages on stack overflows and recursion as examples.
Also, PHP.net is the best place to learn about PHP specifically, but skip the main page and go straight to the tutorial if you're just starting out.
Finally, I highly recommend the book Head First SQL for learning about databases. All of the examples in the book use MySQL. The entire Head First series is great. I hear they have a PHP book as well, but I haven't read that one.
Update: Head First PHP & MySQL is now available.
It sounds like you're missing some of the fundamentals covered in a computer science program. Not to worry, the information is readily available. You don't have to pay someone to teach it (though it's sometimes nice). Wikipedia's computer science entry isn't too bad for highlighting the major fields you're likely to encounter. Topics that are good to know:
Discrete math (Helps to understand formal logic, algorithmic complexity, probability)
Programming fundamentals (sounds like you may have a good start on these)
Data structures (Store and manipulate your data in an appropriate way for a task. For instance, why use a hashtable versus an array versus a linked list? From your list, the stack in a stackoverflow is a data structure.)
Algorithms (Manipulate your data structures in the most efficient way possible or at least know the cost. From your list, using too many recursive functions to "smash the stack" is an algorithmic choice.)
Computer Architecture (Understand what's really happening to your code after it's compiled or interpreted.)
Networks (Learn protocols, what happens to your software when it wants to talk to a machine it's not running on)
Comparative Programming Languages (PHP is just one way to skin a cat. Learn why its designers made the decisions they did and gain exposure to alternatives.)
Operating Systems (Knowing how hardware interacts with your software is good but it's probably more important to understand how it interacts with its operating system. File systems, process management, memory management, security)
Formal Languages/Theory of computation (Models of computing, grammars [used to validate and interpret code], limits of computing. Typically not used day to day as a software engineer. Then again, regular expressions finally made sense after this class.)
Software Design and Life cycle Methodologies (Be deliberate about designing, coding, testing, releasing, and maintaining your software.)
As far as books, I'd start by checking a trusted school's computer science program reading.
Stanford offers a set of classes online for free: http://see.stanford.edu/see/courses.aspx
MIT lets you download course materials for free: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/
Check youtube for computer science related lectures.
If you want something less school-focused, a quick search on Amazon with any of the topics above will give you results with user reviews.
Obviously, taken together, this list isn't really an introduction. I'd start with a topic that sounds interesting and jump in.
Well it is not book about PHP but I think a book like "Learn to program" from the Pragmatic Bookshelf
might be useful for someone in your situation.
To get a good understanding of the inner workings of computer hardware and software in a very readable (not too technical) manner, I can recommend Code,The Hidden Language of Computer Hardware and Software by Charles Petzold.
The later chapters in particular talk about some of these more general programming concepts that are present in most programming languages. The earlier chapters focus on more the history of the computer and software, so not as relevant to your question.
However it's not a large book so the reading it all should be interesting/useful anyway.
There are better books if you are looking for an introduction to PHP/MySQL programming specifically, however if you want more general knowledge about how software and hardware works, Code is great.
I'm not sure if there's one book that'll teach you "The Fundamentals of Programming." The only way I know of learning all these things is simply practice. Get a PHP tutorial and start building things. Always keep your mind open to learning new things. When you find a better way to do something than a likely very inefficient hack you put together from incomplete knowledge, then use it and learn it and integrate it into your knowledge... after a few years of this you'll be golden.
It's been a couple years since I read it, but Programming PHP did a nice job of introducing the fundamental programming concepts along the way. It didn't do much for helping understand more advanced concepts, like MVC (which is rare in the PHP world), it things like arrays, functions, callbacks, classes, etc, where covered.
I like Corbin's comment, but I'll take the opposite approach.
Basically, with systems today, you really don't need to know all of the low level details of systems. Really, you don't.
If you find this stuff interesting, the entire internet is at your disposal -- and simply let you inner muse guide you through either necessity or simple curiousity. You can go as deep or as high as you want.
The truth is, computing today is simply SO fabulous that the project you want to embark upon is just a perfect opportunity to learn more about the arcane world of computing. The fact that folks can get as far as they can "without having a clue" is testament to how far the field has advanced. It's a good thing.
Is it a good thing to understand the process soup to nuts? Sure. Do you actually NEED that understanding to be productive and get useful results out of your time investment? No, you don't.
And, as you progress, if you actually enjoy this work (you well may not), the field goes as broad as you want.
Computing today is like "Home Depot". You can do it, we can help. There are hundreds of forums and thousands of pages of documentation, books, blogs, and discussion available for most any topic.
The key thing to focus on is simply getting your task done. Don't worry about getting it perfect, don't worry about "doing it the right way", don't "engineer" it. Just hammer bits together until you get something close to resembling what you want to get done using whatever you happen to find or intuit yourself. Because that's the easiest way to find out what you don't know, and how to not do things in your application. Try it and see.
You will be blinded by options, techniques, patterns, frameworks, etc. Not only is there "more than one way to do it", there are HUNDREDS of ways "to do it". Ignore the hundreds, and focus on the "doing it", however seems natural to you.
And don't let the yahoo's in their ivory towers poo-poo your questions, or shred your design. Unless their name is "Babbage" or they were cutting gears for the artillery computers back in WWII, we're all standing on the shoulders of giants here, and we all started somewhere. Honest criticism should always be welcome, but some folks seem to be beyond being able to offer that and instead resort to belittling.
I marvel at the applications I've seen "hacked", "butchered" and "OMG'd" together that folks get real, practical use out of -- and that's the real name of the game.
Good luck on your journey. Success in all your endeavors.
Learning is quite individual, so your mileage may vary, but I find that asking questions in public fora is very helpful. If you don't know a lot of the topic it's easy to ask the wrong questions, or somehow focus on the wrong things. Getting direct feedback from someone more experienced usually helps with that. PHP is blessed in that it has a very large and friendly community. Further, a lot of its users are amateurs or inexperienced programmers, which means that there is a culture of asking basic questions about terminology and the like. I suggest that you tap into this source.
One place to start could be at sitepoint, but there are other places
If you're more of a visual learner I would check out some video tutorials. Start with things like basic programming concepts and then move up to titles like Up and Running with PHP and Advanced PHP and MySQL. Then I'd suggest an MVC framework like codeigniter.
You can find great video tutorials on Lynda.com or Pluralsight.com and several other places.

Categories