Best way in php to find most similar strings?

Best way in php to find most similar strings? - php

Hell,
PHP has a lot of string functions like levenshtein, similar_text and soundex that can compare strings for similarity.
http://www.php.net/manual/en/function.levenshtein.php
Which is the best for accuracy and performance?

similar_text has a complexity O(max(n,m)**3) and levenshtein a complexity of O(m*n), where n and m are the lengths of the strings, so levenshtein should be much faster. Both are 100% accurate, in that they give the same output for the same input, but the outputs for each function will differ. If you are using a different measure of accuracy, you'll have to create your own comparison function.

You did not describe your use-case, but in many cases when we speak about natural language words are more important than characters, so both similar_text() and levenshtein() may give less meaningful results at a very high calculation cost.
For example to search articles with similar title in a database using those above with few thousand articles can clog up a server easily.
What I usually do is to write a simple function that accepts two strings, splits them at whitespaces into arrays and count the intersection to get a low-cpu-cost more natural matching score.
With few improvements it can really excel in several use-cases, such as quickly give recommended articles in a blog filtered from other content.
Improvements I usually implement:
lowercase the strings
give score by the matched element's length raised to the power of 2, considering the fact that longer strings are harder to match also they tend to indicate a more meaningful similarity between topics
throw out common words that only modulate meaning before comparison - this is language specific, in English it may be a list such as: was, were, no, not, than, then, here, there etc.
throw out all punctuation marks from the strings before comparison
when dealing with synthetic languages which may attach various endings enrich the array of words with variants of words truncated by the most common suffix lengths before selecting their intersection
It is not perfect, but for comparison this algo processes cca. 5000 thousand blog posts and gives 3 very good similar article with no noticable performance impact while on the same server doing the same with levenshtein takes good 10-15 seconds which is obviously not acceptable for a webpage loading.
And if you need difference instead of similarity, the score can be reciprocated or you could just use the non-matching terms after an array diff instead of the count of matching terms after an array intersect.

Related

Appropriate data structure for faster retrieval process (data size: around 200,000 values all string)

I have a large data set of around 200, 000 values, all of them are strings. Which data structure should i use so that the searching and retrieval process is fast. Insertion is one time, so even if the insertion is slow it wouldn't matter much.
Hash Map could be one solution, but what are the other choices??
Thanks
Edit:
some pointers
1. I am looking for exact matches and not the partial ones.
2. I have to accomplish this in PHP.
3. Is there any way i can keep such amount of data in cache in form of tree or in some other format?

You really should consider not using maps or hash dictionaries if all you need is a string lookup. When using those, your complexity guaranties for N items in a lookup of string size M are O(M x log(N)) or, best amortised for the hash, O(M) with a large constant multiplier. It is much more efficient to use an acyclic deterministic finite automaton (ADFA) for basic lookups, or a Trie if there is a need to associate data. These will walk the data structure one character at a time, giving O(M) with very small multiplier complexity.
Basically, you want a data structure that parses your string as it is consumed by the data structure, not one that must do full string compares at each node of the lookup. The common orders of complexity you see thrown around around for red-black trees and such assume O(1) compare, which is not true for strings. Strings are O(M), and that propagates to all compares used.

Maybe a trie data structure.
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings

Use a TreeMap in that case. Search and Retrieval will be O(log n). In case of HashMap search can be O(n) worst case, but retrieval is O(1).
For 200000 values, it probably won't matter much though unless you are working with hardware constraints. I have used HashMaps with 2 million Strings and they were still fast enough. YMMV.

You can B+ trees if you want to ensure your search is minimal at the cost of insertion time.
You can also try bucket push and search.

Use a hashmap. Assuming implementation similar to Java's, and a normal collision rate, retrieval is O(m) - the main cost is computing the hashcode and then one string-compare. That's hard to beat.
For any tree/trie implementation, factor in the hard-to-quantify costs of the additional pipeline stalls caused by additional non-localized data fetches. The only reason to use one (a trie, in particular) would be to possibly save memory. Memory will be saved only with long strings. With short strings, the memory savings from reduced character storage are more than offset by all the additional pointers/indices.
Fine print: worse behavior can occur when there are lots of hashcode collisions due to an ill-chosen hashing function. Your mileage may vary. But it probably won't.
I don't do PHP - there may be language characteristics that skew the answer here.

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.
We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.
Is there any way to account for typos when creating this sort of threshold for a match of data?
Let me know if I can clarify!

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)
It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.
If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:
Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.
One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.
Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

You should check out this book:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Has a good chapter (3.3) on spell checking
The references at the end of the chapter lists some papers that discuss probabilistic models
Good luck

Speeding up levenshtein / similar_text in PHP

I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings.
Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run it's close to instant.
I'm sure using levenshtein would be slightly faster and the LevenshteinDistance function someone posted in the manual looks interesting. Am I missing something that could make this significantly faster?

In the end, both levenshtein and similar_text were both too slow with the number of strings it had to go through, even with lots of checks and only using them one of them as a last resort.
As an experiment, I ported some of the code to C# to see how much faster it would be over interperated code. It ran in about 3 minutes with the same dataset.
Next I added an extra field to the table and used the double metaphone PECL extension to generate keys for each row. The results were good although since some included numbers this caused duplicates. I guess I could then have run each one through the above functions but decided not to.
In the end I opted for the simplest approach, MySQLs full text which worked very well. Occasionally there are mistakes although they are easy to detect and correct. Also it runs very fast, in around 3-4 seconds.

Perhaps you could 'short-circuit' some checks by first comparing your string for an exact match (and by first comparing if length identical), and if it is skip the more expensive similar_text call.
As #jason noted, an O(N^3) algorithm is never going to be a good choice.

When using levenshtein automaton (automaton that matches a string with distance k) you can do a check for matching in O(n), where n is the length of the string you are checking. Constructing the automaton will take O(kn), where k is max distance and n length of the base string.

PHP - How to suggest terms for search, "did you mean...?"

When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?

A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).

How about the levenshtein function, or similar_text function?

Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.

http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.

When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.

You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.

Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.

You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean

Implementing keyword comparison scheme (reverse search)

I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.
Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).
The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.
SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.
Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.

The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.
You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.
All the proposed solutions, including Aho-Corasick, have these properties in common:
A preprocessing step that takes time and space proportional to the number of keywords in the database.
A search step that takes time and space proportional to the length of the text plus the number of keywords found.
Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.

In general,
break the text into words
b. convert words back to canonical root form
c. drop common conjunction words
d. strip duplicates
insert the words into a temporary table then do an inner join against the keywords table,
or (as you suggested) build the keywords into a complex query criteria
It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.

I'm not 100% clear on what you're asking, but maybe what you're looking for is an inverted index?
Update:
You can use an inverted index to match multiple keywords at once.
Split up the new document into tokens, and insert the tokens paired with an identifier for the document into the inverted index table. A (rather denormalized) inverted index table:
inverted_index
-----
document_id keyword
If you're searching for 3 keywords manually:
select document_id, count(*) from inverted_index
where keyword in (keyword1, keyword2, keyword3)
group by document_id
having count(*) = 3
If you have a table of the keywords you care about, just use an inner join rather than an in() operation:
keyword_table
----
keyword othercols
select keyword_table.keyword, keyword_table.othercols from inverted_index
inner join keyword_table on keyword_table.keyword=inverted_index.keyword
where inverted_index.document_id=id_of_some_new_document
is any of this closer to what you want?

Have you considered graduating to a fulltext solution such as Sphinx?
I'm talking out of my hat here, because I haven't used it myself. But it's getting a lot of attention as a high-speed fulltext search solution. It will probably scale better than any relational solution you use.
Here's a blog about using Sphinx as a fulltext search solution in MySQL.

I would do 2 things here.
First (and this isn't directly related to the question) I'd break up and partition user keywords by users. Having more tables with fewer data, ideally on different servers for distributed lookups where slices or ranges of users exist on different slices. Aka, all of usera's data exists on slice one, userb on slice two, etc.
Second, I'd have some sort of in-memory hash table to determine existence of keywords. This would likely be federated as well to distribute the lookups. For n keyword-existence servers, hash the keyword and mod it by n then distribute ranges of those keys across all of the memcached servers. This quick way lets you say is keyword x being watched, hash it and determine what server it would live on. Then make the lookup and collect/aggregate keywords being tracked.
At that point you'll at least know which keywords are being tracked and you can take your user slices and perform subsequent lookups to determine which users are tracking which keywords.
In short: SQL is not an ideal solution here.

I hacked up some code for scanning for multiple keywords using a dawg (as suggested above referencing the Scrabble paper) although I wrote it from first principles and I don't know whether it is anything like the AHO algorithm or not.
http://www.gtoal.com/wordgames/spell/multiscan.c.html
A friend made some hacks to my code after I first posted it on the wordgame programmers mailing list, and his version is probably more efficient:
http://www.gtoal.com/wordgames/spell/multidawg.c.html
Scales fairly well...
G

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.