I am having a hard time to find a way to detect if two words has the same rhyme in English. It has not to be the same syllabic ending but something closer to phonetically similarity.
I can not believe in 2009 the only way of doing it is using those old fashioned rhyme dictionaries. Do you know any resources (in PHP would be a plus) to help me in this painful task?
Thank you.
Your hints were all really hepful. I will take some time to investigate it. Anyway, more info about DoubleMetaPhone can be found here in a proper PHP code (the other one is an extension).
There are interesting information about MethaPhone function and doublemetaphone in Php.net.
They specially alert about how slow double metaphone is compared with metaphone (something like 100 times slower).
Soundex won't help you. Soundex focuses on the beginning of the word, not its ending. Generally it think you'll have hard time finding any tool to do this. Even to the linguist the root of the word is more interesting, than it's ending.
Generally what you'll have to do is to divide words in syllables and compare their last syllable. Even better if you could divide it in phonemes, reverse their order and do comparison on reversed word. You might trying comparing last part of metaphone keys.
See Bradley Buda's CS project summary from U. Michigan, which uses Levenshtein distance as an atom in finding rhyming English words. I believe combining Levenshtein and soundex should give better results.
Besides the soundex() function ramonzoellner mentioned, there is another function called levenshtein() which calculates the levenshtein distance between the two words. That may help you further.
Seems like you need to find a database containing pronunciation, and possibly stress/emphasis: multisyllabic words with similar last syllables, but stresses on different syllables don't quite rhyme, at least in the sense of being able to use them in poems; e.g. "poems" and "hems". The other answers (levenshtein & soundex) should help for locating candidates, but they won't confirm it:
tough
cough
dough
through
bough
Did you try the soundex() function? It should give you at least some indication if the words sound alike.
Related
I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.
Im pulling in data from a 6 live feeds which is sometimes have slightly different formatting, ie. i might have
'arsenal' and 'arsenal fc'
'T Walcot' and 'Theo Walcot' and 'T. Walcot'
What i was wandering was, is there a simple way to check if the strings match each other on the basis of if they have a certain % of letters in the same order they would be considered the same.
I susppose i could setup a list of related words and terms, but this would mean having to setup it up in advance, but i was wandering if there was an easier, on the fly automated way as i wont be able to compile a full list for a long time.
There's a function just for that:
similar_text('Theo Walcott', 'T. Walcott', $similarity);
echo $similarity;
Have a look at the soundex function http://php.net/soundex and the similar_text function to get a percentage of similarity.
I want to see how phonetically similar two non-English strings are, AFAIK soundex and metaphone implementations only work correctly for English based strings, for instance coração and corassão sound exactly the same in Portuguese but metaphone() returns KR and KRS. The same thing happens with other phonemes, chita and xita returns XT and ST, but they sound the same.
I've also tried this Double Metaphone implementation (demo) but the results are exactly the same.
So, is there any alternative algorithm that works with Portuguese words? I've read about Lucene in this other question, but I've never used it before and I'm not sure how it works or how to use it.
If not, does anyone know what kind of data I need to gather to develop a metaphone-like algorithm?
In case anyone is interested, I found a promising work-in-progress here and some other cool projects.
I'm not sure if this is possible, but is there a way (pre-written library or known scientific detection scheme) to analyse a few sentences of text and determine if the sentences rhyme? A colleague suggested comparing the first and last word and using a thesaurus, but I don't quite understand how that would work.
High accuracy is not what I am aiming for, an accuracy of even 20% would be awesome, it's for a gimmicky little web application idea I have nothing important just thought it would be cool.
I am open to trying other languages, perhaps even Python which I've heard is great for analysing text but PHP would be preferable.
Metaphone http://www.php.net/manual/en/function.metaphone.php
You could classify an input into phonetics (sounds) and then check if the same sound appears frequently. Since each one should match up with syllables, you could calculate the Levenshtein distance (count the syllables between the matches) to see if they fit into some known pattern, I.e. haiku.
http://www.php.net/manual/en/function.levenshtein.php
http://php.net/manual/en/function.soundex.php
I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf
There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.
Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?
EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.
I think you can solve this problem by using "Levenshtein distance" metric.
From wikipedia;
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences.
Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.
As of PHP 5.5, PHP's strtr uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.
A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick