Localized (Double) Metaphone for Portuguese (pt_PT)

Localized (Double) Metaphone for Portuguese (pt_PT) - php

I want to see how phonetically similar two non-English strings are, AFAIK soundex and metaphone implementations only work correctly for English based strings, for instance coração and corassão sound exactly the same in Portuguese but metaphone() returns KR and KRS. The same thing happens with other phonemes, chita and xita returns XT and ST, but they sound the same.
I've also tried this Double Metaphone implementation (demo) but the results are exactly the same.
So, is there any alternative algorithm that works with Portuguese words? I've read about Lucene in this other question, but I've never used it before and I'm not sure how it works or how to use it.
If not, does anyone know what kind of data I need to gather to develop a metaphone-like algorithm?

In case anyone is interested, I found a promising work-in-progress here and some other cool projects.

Related

Accuracy of metaphone for word search in dictionary database in php

I am going to implement did you mean feature for my application in php. Words from database is taken as dictionary. I think similar words can be find out more accurately using metaphone other than that of levenshtein, similer_text, soundex etc. Can anyone suggest about the accuracy of metaphone.
Thanks in advance.

Metaphone is an approvement of Soundex. The built in php function use the original algorithm from 1990. The are to further improvement since 1990. The Double Metaphone, you can use the as PECL extension from http://pecl.php.net/package/doublemetaphone or download the class file from http://swoodbridge.com/DoubleMetaPhone/.
(you can find a mysql function on http://www.atomodo.com/code/double-metaphone/metaphone.sql/view)
The next improvement Metaphone 3 can be bought on http://www.amorphics.com/buy_metaphone3.html

Analysing English Text Sentences To Detect Rhymes in PHP

I'm not sure if this is possible, but is there a way (pre-written library or known scientific detection scheme) to analyse a few sentences of text and determine if the sentences rhyme? A colleague suggested comparing the first and last word and using a thesaurus, but I don't quite understand how that would work.
High accuracy is not what I am aiming for, an accuracy of even 20% would be awesome, it's for a gimmicky little web application idea I have nothing important just thought it would be cool.
I am open to trying other languages, perhaps even Python which I've heard is great for analysing text but PHP would be preferable.

Metaphone http://www.php.net/manual/en/function.metaphone.php
You could classify an input into phonetics (sounds) and then check if the same sound appears frequently. Since each one should match up with syllables, you could calculate the Levenshtein distance (count the syllables between the matches) to see if they fit into some known pattern, I.e. haiku.
http://www.php.net/manual/en/function.levenshtein.php
http://php.net/manual/en/function.soundex.php

RegEx: Compare two strings to find Alliteration and Assonance

would be possible to Compare two strings to find Alliteration and Assonance?
i use mainly javascript or php

I'm not sure that a regex would be the best way of building a robust comparison tool. A simple regex might be part of a larger solution that used more sophisticated algorithms for non-exact matching.
There are a variety of readily-available options for English, some of which could be extended fairly simply to languages that use the Latin alphabet. Most of these algorithms have been around for years or even decades and are well-documented, though they all have limits.
I imagine that there are similar algorithms for non-Latin alphabets but I can't comment on their availability firsthand.
Phonetic Algorithms
The Soundex algorithm is nearly 100 years old and has been implemented in multiple programming languages. It is used to determine a numeric value based on the pronunciation of a string. It is not precise but it may be useful for identifying similar sounding words/syllables. I've experimented with it in MS SQL Server and it is available in PHP.
http://php.net/manual/en/function.soundex.php
General consensus (including the PHP docs) is that Metaphone is much more accurate than Soundex when dealing with the English language. There are numerous implementations available (Wikipedia has a long list at the end of the article) and it is included in PHP.
http://www.php.net/manual/en/function.metaphone.php
Double Metahpone supports a second encoding of a word corresponding to an alternate pronunciation of the word.
As with Metaphone, Double Metaphone has been implemented in many programming languages (example).
Word Deconstruction
Levenshtein can be used to suggest alternate spellings (for example, to normalize user input) and might be useful as part of a more granular algorithm for alliteration and assonance.
http://www.php.net/manual/en/function.levenshtein.php
Logically, it would help to understand the syllabication of the words in the string so that each word could be deconstructed. The syllable break could resolve ambiguity as to how two adjacent letters should be pronounced. This thread has a few links:
PHP Syllable Detection

To find alliterations in a text you simply iterate over all words, omitting too short and too common words, and collect them as long as their initial letters match.
text = ''
+'\nAs I looked to the east right into the sun,'
+'\nI saw a tower on a toft worthily built;'
+'\nA deep dale beneath a dungeon therein,'
+'\nWith deep ditches and dark and dreadful of sight'
+'\nA fair field full of folk found I in between,'
+'\nOf all manner of men the rich and the poor,'
+'\nWorking and wandering as the world asketh.'
skipWords = ['the', 'and']
curr = []
text.toLowerCase().replace(/\b\w{3,}\b/g, function(word) {
if (skipWords.indexOf(word) >= 0)
return;
var len = curr.length
if (!len || curr[len - 1].charAt(0) == word.charAt(0))
curr.push(word)
else {
if (len > 2)
console.log(curr)
curr = [word]
}
})
Results:
["deep", "ditches", "dark", "dreadful"]
["fair", "field", "full", "folk", "found"]
["working", "wandering", "world"]
For more advanced parsing and also to find assonances and rhymes you first have to translate a text into phonetic spelling. You didn't say which language you're targeting, for English there are some phonetic dictionaries available online, for example from Carnegie Mellon: ftp://ftp.cs.cmu.edu/project/fgdata/dict

Multiple keyword (100s to 1000s) search (string-search algorithm) in PHP

I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf
There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.
Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?
EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.

I think you can solve this problem by using "Levenshtein distance" metric.
From wikipedia;
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences.
Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.

As of PHP 5.5, PHP's strtr uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.
A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick

Rhyme in PHP

I am having a hard time to find a way to detect if two words has the same rhyme in English. It has not to be the same syllabic ending but something closer to phonetically similarity.
I can not believe in 2009 the only way of doing it is using those old fashioned rhyme dictionaries. Do you know any resources (in PHP would be a plus) to help me in this painful task?
Thank you.
Your hints were all really hepful. I will take some time to investigate it. Anyway, more info about DoubleMetaPhone can be found here in a proper PHP code (the other one is an extension).
There are interesting information about MethaPhone function and doublemetaphone in Php.net.
They specially alert about how slow double metaphone is compared with metaphone (something like 100 times slower).

Soundex won't help you. Soundex focuses on the beginning of the word, not its ending. Generally it think you'll have hard time finding any tool to do this. Even to the linguist the root of the word is more interesting, than it's ending.
Generally what you'll have to do is to divide words in syllables and compare their last syllable. Even better if you could divide it in phonemes, reverse their order and do comparison on reversed word. You might trying comparing last part of metaphone keys.

See Bradley Buda's CS project summary from U. Michigan, which uses Levenshtein distance as an atom in finding rhyming English words. I believe combining Levenshtein and soundex should give better results.

Besides the soundex() function ramonzoellner mentioned, there is another function called levenshtein() which calculates the levenshtein distance between the two words. That may help you further.

Seems like you need to find a database containing pronunciation, and possibly stress/emphasis: multisyllabic words with similar last syllables, but stresses on different syllables don't quite rhyme, at least in the sense of being able to use them in poems; e.g. "poems" and "hems". The other answers (levenshtein & soundex) should help for locating candidates, but they won't confirm it:
tough
cough
dough
through
bough

Did you try the soundex() function? It should give you at least some indication if the words sound alike.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.