How to determine if a sentence is talking about a specific subject?

How to determine if a sentence is talking about a specific subject? - php

I have predefined words and would like to know if the sentence primary subject is about the predefined words.
Example:
Predefined words:
iPhone, Nexus, HTC
Sentence:
I like the new design of iPhone - primary subject is iPhone
I am listing to Nirvana on my Nexus. - primary subject is not in predefined words
The HTC phone is better than iPhone - primary subject is HTC
Would like to do this in PHP or something I that can have PHP interface.

Alias-i has a natural language parser for PHP.
Edit: this page says Alias-i's parser is written in PHP, but Alias-i's website says it is written in Java.

The short version: By Keywords.
This method works only with a limited set of Keywords.
A related question might be: Using preg_match to find all words in a list
The long version: By parsing the language and making the computer system understand it.
The later is something linguists do. They develop such systems and it takes years. Probably you find some implementations available, but I do not know any from memory. Would need to ask a friend.

Try to get goog heurstic and evaluate them.
Examples:
1. Keyword is at beginning of sentence.
2. There are only one keyword in text.
3. Is there are continius form like "litenining" etc usaly leadt to subjective/uninformative message.
Write classifier upon those features. I would recommend Mallet.

Related

Extract URL containing /find/ from numerous URL's?

I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA

I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);

I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*

In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?

Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.

"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?

If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.

You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

Calling wordnet from php (Wordnet class or API for PHP)

I am trying to write a program to find similarity between two documents, and since im using only english, I decided to use wordnet, but I cannot find a way to link the wordnet with php, I cannot find any wordnet api from php.
I saw in the forum some one said (Spudley) he called wordnet from php (using shell_exec() function),
Thesaurus class or API for PHP [edited]
I would really like to know a method used or some example code, a tutorial perhaps to start using the wordnet with php.
many thanks

The PHP extension which is linked to from the WordNet site is very old and out of date -- it claims to work with PHP4, so I don't think it's been looked at in years.
There aren't any other APIs available for WordNet->PHP, so I rolled my own solution.
WordNet can be run from the command-line, so PHP's shell_exec() function can read the output.
If you run WordNet from the command-line (cd to Wordnet's directory, then just wn) without any parameters, it will show you a list of possible functions that Wordnet supports.
Still in the command-line, if you then try one/some of those functions, you'll see how Wordnet outputs its results. For example, if you want synonyms for the word 'star', you could try the -synsn function:
wn star -synsn
This will produce output that looks a bit like this:
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun star
8 senses of star
Sense 1 star
=> celestial body, heavenly body
Sense 2 ace, adept, champion, sensation, maven, mavin, virtuoso, genius, hotshot, star, superstar, whiz, whizz, wizard, wiz
=> expert
Sense 3 star
=> celestial body, heavenly body
Sense 4 star
=> plane figure, two-dimensional figure
Sense 5 star, principal, lead
=> actor, histrion, player, thespian, role player
Sense 6 headliner, star
=> performer, performing artist
Sense 7 asterisk, star
=> character, grapheme, graphic symbol
Sense 8 star topology, star
=> topology, network topology
In PHP, you can read this same output using the shell_exec() function.
$result = shell_exec('/path/to/wn '.$word.' -synsn');
Now $result should contain the block of text quoted above.
At this point, you have to do some proper coding. You'll need to take that block of text and parse it for the data you want.
This is where it gets tricky. Because the data is presented in a format designed to be read by a human rather than by a program, it is tricky to parse accurately.
It is important to note that different search options present their output slightly differently. And, some of the results that are returned can be somewhat esoteric. I ended up writing a weighting system to score the results, but it was fairly specific to my needs, so you'll need to experiment with it to come up with your own system.
I hope that's enough help for you. :)

I know it's kinda too late but recently I made a library to scratch my own itch
Wordnet php wrapper

Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.

Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.

I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.

I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.

As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.

Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!

If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.

Determine context/meaning of a web page (or paragraph of text)

Of course Google has been doing this for years! However, rather than start from scratch, spend 10 years+ and squander large sums of money :) I was wondering if anyone knows of a simple PHP library that would return a list of important words (and/or some sort of context) from a web page or chunk of text using PHP?
On a basic level, I am guessing the most spiders will pull in words, remove words without real meaning, then count the rest. The most occurring words would most likely be what I'm interested in.
Any sort of pointers would be really appreciated!

Latent Semantic Indexing.
I can give you pointers, but you want to look up/research Latent Semantic Indexing.
Rather than explain it, here is a quick snippet from a webpage.
Latent semantic indexing is
essentially a way of extracting the
meaning from a document without
matching a specific phrase. A simple
example would be that a document
featuring the words ‘Windows’, ‘Bing’,
‘Excel’ and ‘Outlook’ would be about
Microsoft. You wouldn’t need
‘Microsoft’ to appear again and again
to know that.
This example also highlights the
importance of taking into account
related words because if ‘windows’
appeared on a page that also featured
‘glazing’, it would most likely be an
entirely different meaning.
You can of course go down the easy route of dropping all stop words from the text corpus, but LSI is definately more accurate.
I will update this post with more info in about 30 minutes.
(Still intending to update this post - Got too busy with work).
Update
Okay, so the basics behind LSA, is to offer a new/different approach for retieving a document based on a particular search time. You could very easily use it for determining the meaning of a document however though too.
One of the problems with the search of yester-years was that they were based on keywords analysis. If you take Yahoo/Altavista from the late 1999's through to probably 2002/03 (don't quote me on this), they were extremely dependant on ONLY using keywords as a factor of retrieving a document from their index. Keywords however, don't translate to anything other than the keyword which they represent.
However, the keyword "Hot", means lots of things depending on the context which it is placed. If you were to take the term "hot" and identity that it was placed around other terms such as "chillies", "spices" or "herbs", then conceptually it means something totally different to the term "hot" when surronding by other terms such as "heat" or "warmth" or "sexy" and "girl".
LSA attempts to overcome these defficiencies by working upon a matrix of statisical probalities, (which you build yourself).
Anyway onto some tools that help you to build this matrix of document/terms (and cluster them in a proximity which relates to their corpus). This works to the benefit of search engines, by transposing keywords into concepts, so that if you search for a particular keyword, that keyword might not even appear in documents which are retrieved, but the concept which the keyword represents does.
I've always used Lucence / Solr for search. And doing a quick Google search, for Solr LSA LSI returned a few links.
http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
This guy seems to have created a plugin for it.
http://github.com/algoriffic/lsa4solr
I might check it out over the next few weeks and see how it gets on.

Go have a look at Calais and Zemanta. Very cool stuff!

Personally, I'd be inclined to use something like a Brill parser to identify the part of speech of each word, discarding pronouns, verbs, etc and using that to extract a list of nouns (possibly with any qualifying adjectives) to build that list of keywords. You can find a PHP implementation of a Brill Parser on Ian Barber's PHP/IR site.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.