I want to implement some applications with n-grams (preferably in PHP).
Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?
First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams:
Sentence: "I live in NY."
word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"
When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:
word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]
Is this correct?
Furthermore, I would like to learn more about what you can do with n-grams:
How can I identify the language of a text using n-grams?
Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
How can I build a spam filter (spam, ham)? Combine n-grams with a Bayesian filter?
How can I do topic spotting? For example: Is a text about basketball or dogs? My approach (do the following with a Wikipedia article for "dogs" and "basketball"): build the n-gram vectors for both documents, normalize them, calculate Manhattan/Euclidian distance, the closer the result is to 1 the higher is the similarity
What do you think about my application approaches, especially the last one?
I hope you can help me. Thanks in advance!
Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram the probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..
Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.
One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be the in one document and is in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.
Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include the, a, is, and of would be considered a better match than two documents that both included Baysian. Removing stop-words from your n-grams of interest would improve this somewhat.
You are correct about the definition of n-grams.
You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.
An n-gram tokenizer for words in PHP can be done using strtok:
http://us2.php.net/manual/en/function.strtok.php
For characters use split:
http://us2.php.net/manual/en/function.str-split.php
Then you can just split the array as you'd like to any number of n-grams.
Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.
Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.
Related
I'm not sure if this is possible, but is there a way (pre-written library or known scientific detection scheme) to analyse a few sentences of text and determine if the sentences rhyme? A colleague suggested comparing the first and last word and using a thesaurus, but I don't quite understand how that would work.
High accuracy is not what I am aiming for, an accuracy of even 20% would be awesome, it's for a gimmicky little web application idea I have nothing important just thought it would be cool.
I am open to trying other languages, perhaps even Python which I've heard is great for analysing text but PHP would be preferable.
Metaphone http://www.php.net/manual/en/function.metaphone.php
You could classify an input into phonetics (sounds) and then check if the same sound appears frequently. Since each one should match up with syllables, you could calculate the Levenshtein distance (count the syllables between the matches) to see if they fit into some known pattern, I.e. haiku.
http://www.php.net/manual/en/function.levenshtein.php
http://php.net/manual/en/function.soundex.php
I'm attempting to create an algorithm that will suggest Mad Gab style phrases.
The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:
loop over phrases (character by character)
if keyword is found
store keyword and branch (recursion)
increment character count
However, the problems I am running into are:
Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
Allow literal terms - "the", "and", "one", "two", "three".
How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.
I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.
I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.
Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:
http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584
If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:
http://www.tug.org/docs/liang/
Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.
Vowels:
http://www.eslgold.com/pronunciation/english_vowel_sounds.html
Consonants:
http://usefulenglish.ru/phonetics/english-consonant-sounds
Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.
Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.
Example:
"Logical consequence"
A. Syllable split
"lo gi cal con se quence"
B. Vowel Sounds applied
"lah gee cahl con see quince"
C. Consonant Sounds applied
"lah jee kahl kon see quinse"
D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)
"Law Gee Call Con Sea Quints"
Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.
An example of a Soundex MySQL comparison is:
select strcmp(soundex('lah'), soundex('law'));
I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.
My suggestion may be inefficient, but optimization is a different question.
Update:
I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".
Taking a different tack from Jonathan Barlow's solution, I recommend an O(n2) algorithm that gives you the properties you seek, in randomness, robustness, and scalable difficulty. The complexity of this algorithm can be further improved in constant time or with optimizations to the modality of the search, but because the size of your input phrases is guaranteed to be small, it's not that big a deal.
Construct a hash table of all known words in the Oxford English Dictionary and a map of lists of words by soundex() value. This initially sounds intractable, until you realize that there aren't actually that many of them in current use. Assuming a decent one-way hashing algorithm, this should take several megabytes, tops.
Consider the words in your input phrase as a single, compressed string of characters with no word identity whatsoever, discarding whitespace and all punctuation. From this, walk the space for all character lengths, starting with a length of one, up to the full length of the amalgamated phrase minus one. For each string produced by this walk, perform a hash lookup against OED. When a word is encountered that's present in the dictionary, append its word and position to the end of a list in memory.(This pass will always take sum(n) time, which is by definition 0.5n(n+1). So, O(n2) it is. Its space complexity is worst-case O(n2), but in practice, a fully connected set of terms is extremely unlikely.)
Now comes your difficulty slider. From the produced list, chop off the first N% of the found terms, where N is your level of difficulty. The principle here is, smaller words are easier for someone to lexically process, while longer words are more difficult to sound out and differentiate.
Construct an array conforming to the original length of the phrase (without spaces and punctuation) and shuffle your list of encountered words. Now, walk the shuffled list. For each element, verify if all of the slots in the array are free for that word at its original position. If they are, keep the word and its position, marking the slots as used in the array. If they are not, iterate to the next word until the list is exhausted.*
From the final output array, construct a partitioned list of unused characters in the space, treating each bag of characters as its own phrase. For this list, perform syllable detection exactly as sketched out here, passing the results to metaphone() with a percentage chance of glomming two or more syllables together. Then, for the bag of output dictionary words from 4., perform soundex(), pulling a random word from the word's mapped list of comparable soundex values. For every word that can only soundex() to itself according to the backing map of lists, perform partitioning and metaphone(). Finally, stitch the two lists of results together by sorting on position and print your result.
This is a random algorithm with what I believe to be all of the desired properties, but it's still rough in my mind.
* Extra credit: determine the allowed overlaps for your system by character or syllable. This can make for an even larger gamut of accepted output phrases and a much higher level of difficulty.
would be possible to Compare two strings to find Alliteration and Assonance?
i use mainly javascript or php
I'm not sure that a regex would be the best way of building a robust comparison tool. A simple regex might be part of a larger solution that used more sophisticated algorithms for non-exact matching.
There are a variety of readily-available options for English, some of which could be extended fairly simply to languages that use the Latin alphabet. Most of these algorithms have been around for years or even decades and are well-documented, though they all have limits.
I imagine that there are similar algorithms for non-Latin alphabets but I can't comment on their availability firsthand.
Phonetic Algorithms
The Soundex algorithm is nearly 100 years old and has been implemented in multiple programming languages. It is used to determine a numeric value based on the pronunciation of a string. It is not precise but it may be useful for identifying similar sounding words/syllables. I've experimented with it in MS SQL Server and it is available in PHP.
http://php.net/manual/en/function.soundex.php
General consensus (including the PHP docs) is that Metaphone is much more accurate than Soundex when dealing with the English language. There are numerous implementations available (Wikipedia has a long list at the end of the article) and it is included in PHP.
http://www.php.net/manual/en/function.metaphone.php
Double Metahpone supports a second encoding of a word corresponding to an alternate pronunciation of the word.
As with Metaphone, Double Metaphone has been implemented in many programming languages (example).
Word Deconstruction
Levenshtein can be used to suggest alternate spellings (for example, to normalize user input) and might be useful as part of a more granular algorithm for alliteration and assonance.
http://www.php.net/manual/en/function.levenshtein.php
Logically, it would help to understand the syllabication of the words in the string so that each word could be deconstructed. The syllable break could resolve ambiguity as to how two adjacent letters should be pronounced. This thread has a few links:
PHP Syllable Detection
To find alliterations in a text you simply iterate over all words, omitting too short and too common words, and collect them as long as their initial letters match.
text = ''
+'\nAs I looked to the east right into the sun,'
+'\nI saw a tower on a toft worthily built;'
+'\nA deep dale beneath a dungeon therein,'
+'\nWith deep ditches and dark and dreadful of sight'
+'\nA fair field full of folk found I in between,'
+'\nOf all manner of men the rich and the poor,'
+'\nWorking and wandering as the world asketh.'
skipWords = ['the', 'and']
curr = []
text.toLowerCase().replace(/\b\w{3,}\b/g, function(word) {
if (skipWords.indexOf(word) >= 0)
return;
var len = curr.length
if (!len || curr[len - 1].charAt(0) == word.charAt(0))
curr.push(word)
else {
if (len > 2)
console.log(curr)
curr = [word]
}
})
Results:
["deep", "ditches", "dark", "dreadful"]
["fair", "field", "full", "folk", "found"]
["working", "wandering", "world"]
For more advanced parsing and also to find assonances and rhymes you first have to translate a text into phonetic spelling. You didn't say which language you're targeting, for English there are some phonetic dictionaries available online, for example from Carnegie Mellon: ftp://ftp.cs.cmu.edu/project/fgdata/dict
This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.
Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.
Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.
This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.
You could try using google n-grams to achieve this.
If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
function getRight($arr){
$_final='';
$count=count($arr);
// Remove multi spaces AND get string lengths
for($i=0;$i<$count;$i++){
$arr[$i]=preg_replace('/\s\s+/', ' ',$arr[$i]);
$len[$i]=strlen($arr[$i]);
}
// Max length
$_max=max($len);
for($i=0;$i<$_max;$i++){
$_el=array();
for($j=0;$j<$count;$j++){
// Cheking letter counts
$_letter=$arr[$j][$i];
if(isset($_el[$_letter]))$_el[$_letter]++;
else$_el[$_letter]=1;
}
//Most probably count
list($mostProbably) = array_keys($_el, max($_el));
$_final.=$mostProbably;
// If probbaly example is not space
if($_el!=' '){
// THERE NEED TO BE CODE FOR REMOVING SPACE FROM LINES WHERE $text[$i] is space
}
}
return $_final;
}
echo getRight($text);
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.