Speeding up levenshtein / similar_text in PHP - php

I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings.
Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run it's close to instant.
I'm sure using levenshtein would be slightly faster and the LevenshteinDistance function someone posted in the manual looks interesting. Am I missing something that could make this significantly faster?

In the end, both levenshtein and similar_text were both too slow with the number of strings it had to go through, even with lots of checks and only using them one of them as a last resort.
As an experiment, I ported some of the code to C# to see how much faster it would be over interperated code. It ran in about 3 minutes with the same dataset.
Next I added an extra field to the table and used the double metaphone PECL extension to generate keys for each row. The results were good although since some included numbers this caused duplicates. I guess I could then have run each one through the above functions but decided not to.
In the end I opted for the simplest approach, MySQLs full text which worked very well. Occasionally there are mistakes although they are easy to detect and correct. Also it runs very fast, in around 3-4 seconds.

Perhaps you could 'short-circuit' some checks by first comparing your string for an exact match (and by first comparing if length identical), and if it is skip the more expensive similar_text call.
As #jason noted, an O(N^3) algorithm is never going to be a good choice.

When using levenshtein automaton (automaton that matches a string with distance k) you can do a check for matching in O(n), where n is the length of the string you are checking. Constructing the automaton will take O(kn), where k is max distance and n length of the base string.

Related

Mathematical formula string to php variables and operators

I have to following problem. I find it difficult to explain, as I am a hobby coder. Please forgive me if commit any major fauxpas:
I am working on a baseball database that deals with baseball specfic and non specific metrics/stats. Most of the date output is pretty simple when the metrics are cumulative or when I only want to display the dataset of one day that has been entered manually or imported from a csv. (All percentages are calculated and fed into the db as numbers)
For example, if I put in the stats of
{ Hits:1,
Walks:2,
AB:2,
Bavg:0.500 }
for day one, and
{ Hits:2,
Walks:2,
AB:6,
Bavg:0.333 }
and then try to get the totals, Hits, Walks and ABs are simple: SUM. But Bavg has to be a formula (Hits/AB). Other (non baseball specific) metrics, like vertical jump or 60 yard times are pretty straight forward too: MAX or MIN.
The user should be able to add his own metrics. So he has to be able to input a formula for the calculation. This calculation is stored in the database table as a column next to the metric name, id, type (cumulative, max, min, calculated).
The php script that produces the html table is setup to where it is dynamic to what ever metrics and however many metrics the query sends (the metrics can be part of several categories).
In the end result, I want to replace all values of metrics of the calculated types with their formula.
My approach is to get the formula from the mysql table as a string. Then, in php, convert the string that could be Strikes/Pitches*100 into $strikes/$pitches*100 - assuming that is something I could put into an php sql query. However, before it is put into the $strikes/$pitches*100 format, I need to have those variables available to define them. That I'm sure I can do, but I'll cross that bridge when I get there.
Could you point me in the right direction of either how to accomplish that or tell where or what to search for? I'm sure this has been done before somewhere...
I highly appreciate any help!
Clemens
The correct solution has already been given by Vilx-. So I will give you a not-so-correct, dirty solution.
As the correct solution states, eval is evil. But, it is also easy and powerful (as evil often is -- but I'll spare you my "hhh join the Dark Side, Luke hhh" spiel).
And since what you need to do is a very small and simple subset of SQL, you actually can use eval() - or even better its SQL equivalent, plugging user supplied code into a SQL query - as long as you do it safely; and with small requirements, this is possible.
(In the general case it absolutely is not. So keep it in mind - this solution is easy, quick, but does not scale. If the program grows beyond a certain complexity, you'll have to adopt Vilx-'s solution anyway).
You can verify the user-supplied string to ensure that, while it might not be syntactically or logically correct, at least it won't execute arbitrary code.
This is okay:
SELECT SUM(pitch)+AVG(runs)-5*(MIN(balls)) /* or whatever */
and this, while wrong, is harmless too:
SELECT SUM(pitch +
but this absolutely is not (mandatory XKCD reference):
SELECT "Robert'); DROP TABLE Students;--
and this is even worse, since the above would not work on a standard MySQL (that doesn't allow multiple statements by default), while this would:
SELECT SLEEP(3600)
So how do we tell the harmless from the harmful? We start by defining placeholder variables that you can use in your formula. Let us say they will always be in the form {name}. So, get those - which we know to be safe - out of the formula:
$verify = preg_replace('#{[a-z]+}#', '', $formula);
Then, arithmetic operators are also removed; they are safe too.
$verify = preg_replace('#[+*/-]#', '', $verify);
Then numbers and things that look like numbers:
$verify = preg_replace('#[0-9.]+#', '', $verify);
Finally a certain number of functions you trust. The arguments of these functions may have been variables or combinations of variables, and therefore they've been deleted and the function has now no arguments - say, SUM() - or you had nested functions or nested parentheses, like SUM (SUM( ()())).
You keep replacing () (with any spaces inside) with a single space until the replacement no longer finds anything:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\s*\\(\\s*\\)\\s*#', ' ', $verify)) {
$old = $verify;
}
Now you remove from the result the occurrences of any function you trust, as an entire word:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\b(SUM|AVG|MIN|MAX)\\b#', ' ', $verify)) {
$old = $verify;
}
The last two steps have to be merged because you might have both nested parentheses and functions, interfering with one another:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\s*(\\b(SUM|AVG|MIN|MAX)\\b|\\(\\s*\\))\\s*#', ' ', $verify)) {
$old = $verify;
}
And at this point, if you are left with nothing, it means the original string was harmless (at worst it could have triggered a division by 0, or a SQL exception if it was syntactically wrong). If instead you're left with something, the formula is rejected and never saved in the database.
When you have a valid formula, you can replace variables using preg_replace_callback() so that they become numbers (or names of columns). You're left with what is either valid, innocuous SQL code, or incorrect SQL code. You can plug this directly into the query, after wrapping it in try/catch to intercept any PDOException or division by zero.
I'll assume that the requirement is indeed to allow the user to enter arbitrary formulas. As noted in the comments, this is indeed no small task, so if you can settle for something less, I'd advise doing so. But, assuming that nothing less will do, let's see what can be done.
The simplest idea is, of course, to use PHP's eval() function. You can execute arbitrary PHP code from a string. All you need to do is to set up all necessary variables beforehand and grab the return value.
It does have drawbacks however. The biggest one is security. You're essentially executing arbitrary user-supplied code on your server. And it can do ANYTHING that your own code can. Access files, use your database connections, change variables, whatever. Unless you completely trust your users, this is a security disaster.
Also syntax or runtime errors can throw off the rest of your script. And eval() is pretty slow too, since it has to parse the code every time. Maybe not a big deal in your particular case, but worth keeping an eye on.
All in all, in every language that has an eval() function, it is almost universally considered evil and to be avoided at all costs.
So what's the alternative? Well, a dedicated formula parser/executor would be nice. I've written one a few times, but it's far from trivial. The job is easier if the formula is written in Polish notation or Reverse Polish Notation, but those are a pain to write unless you've practiced. For normal formulas, take a look at the Shunting Yard Algorithm. It's straightforward enough and can be easily adapted to functions and whatnot. But it's still fairly tedious.
So, unless you want to do it as a fun challenge, look for a library that has already done it. There seem to be a bunch of them out there. Search for something along the lines of "arithmetic expression parser library php".

What is the best strategy to compare two Paragarphs in PHP & MySQL?

I have already Developed a Typing Software to capture Text Typed by candidates in my institutes using PHP & MySQL. In the continuation process, I am stuck with a strategic issue as to how should I compare the Similarity of Texts typed by the Candidates with the Standard Paragraph which I had given them to Type(in the form of Hard Copy, though the same copy is also stored in the MySQL database). My dilemma is that, whether I would use the Levensthein Distance Algorithm in PHP or in MySQL directly itself so that the performance issue is optimized. Actually. I am afraid if Programming in PHP would come out erroneous while evaluating the Texts. It is worthwhile to mention here that the Texts would be compared to get the rank on the basis of Words Typed Per Minute.
The simplest solution would be to utilize PHP's built-in levenshteindocs function to compare the two blocks of text. If you wanted to back the processing off to the MySQL database, you could implement the solution listed in Levenshtein: MySQL + PHPStackOverflow
Another PHP option might be the similar_textdocs function.
The unfortunate drawback for the PHP levenshtein function is that it cannot handle strings longer than 255 characters. As per the php manual docs:
This function returns the Levenshtein-Distance between the two
argument strings or -1, if one of the argument strings is longer than
the limit of 255 characters.
So, if your paragraphs are longer than that you may be forced to implement a MySQL solution, though. I suppose you could break the paragraphs up into 255-character blocks for comparison (though I can't say definitively that this won't "break" the levenshtein algorithm).
I'm not an expert in linguistics parsing and processing, so I can't speak to whether these are the best solutions (as you mention in your question). They are, however, very straightforward and simple to implement.

Best way in php to find most similar strings?

Hell,
PHP has a lot of string functions like levenshtein, similar_text and soundex that can compare strings for similarity.
http://www.php.net/manual/en/function.levenshtein.php
Which is the best for accuracy and performance?
similar_text has a complexity O(max(n,m)**3) and levenshtein a complexity of O(m*n), where n and m are the lengths of the strings, so levenshtein should be much faster. Both are 100% accurate, in that they give the same output for the same input, but the outputs for each function will differ. If you are using a different measure of accuracy, you'll have to create your own comparison function.
You did not describe your use-case, but in many cases when we speak about natural language words are more important than characters, so both similar_text() and levenshtein() may give less meaningful results at a very high calculation cost.
For example to search articles with similar title in a database using those above with few thousand articles can clog up a server easily.
What I usually do is to write a simple function that accepts two strings, splits them at whitespaces into arrays and count the intersection to get a low-cpu-cost more natural matching score.
With few improvements it can really excel in several use-cases, such as quickly give recommended articles in a blog filtered from other content.
Improvements I usually implement:
lowercase the strings
give score by the matched element's length raised to the power of 2, considering the fact that longer strings are harder to match also they tend to indicate a more meaningful similarity between topics
throw out common words that only modulate meaning before comparison - this is language specific, in English it may be a list such as: was, were, no, not, than, then, here, there etc.
throw out all punctuation marks from the strings before comparison
when dealing with synthetic languages which may attach various endings enrich the array of words with variants of words truncated by the most common suffix lengths before selecting their intersection
It is not perfect, but for comparison this algo processes cca. 5000 thousand blog posts and gives 3 very good similar article with no noticable performance impact while on the same server doing the same with levenshtein takes good 10-15 seconds which is obviously not acceptable for a webpage loading.
And if you need difference instead of similarity, the score can be reciprocated or you could just use the non-matching terms after an array diff instead of the count of matching terms after an array intersect.

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.
We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.
Is there any way to account for typos when creating this sort of threshold for a match of data?
Let me know if I can clarify!
First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)
It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.
If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:
Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.
One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.
Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.
You should check out this book:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Has a good chapter (3.3) on spell checking
The references at the end of the chapter lists some papers that discuss probabilistic models
Good luck

Searching large data, all numeric, 1 billion bytes in PHP

I was wondering how I could quickly search a data string of up to 1 billion bytes of data. The data is all numeric. Currently, we have the data split into 250k files and the searches using strpos (fastest built-in function) on each file until it finds something.
Is there a way I can index to make it go faster? Any suggestions?
Eventually I would like to find multiple occurrences, which, as of now, would be done with the offset parameter on strpos.
Any help would surely lead to recognition where needed.
Thanks!
- James Hartig
Well, your tags indicate what you should do (the tag I am referring to is "indexing").
Basically, you should have separate files which would have the indexes for the data. It would have the data strings that you are looking for, as well as the file and byte positions that it is in.
You would then access the index, look up your value and then find the location(s) in the original file(s) for the data string, and process from there.
A good answer may require that you get a little more specific.
How long is the search query? 1 digit? 10 digits? Arbitrary length?
How "fast" is fast enough? 1 second? 10 seconds? 1 minute?
How many total queries per second / minute / hour do you expect?
How frequently does the data change? Every day? Hour? Continuously?
When you say "multiple occurrences" it sounds like you mean overlapping matches.
What is the "value" of the answer and to how many people?
A billion ain't what it used to be so you could just index the crap out of the whole thing and have an index that is 10 or even 100 times the original data. But if the data is changing by the minute, that would mean your were burning more cycles to create the index than to search it.
The amount of time and money you put into a solution is a function of the value of that solution.
You should definitely get a girlfriend. Besides helping you spend your time better it can grow fat without bursting. Oh, and the same goes for databases.
All of Peter Rowell's questions pertain. If you absolutely must have an out-of-the box answer then try grep. You can even exec it from PHP if you like. It is orders of magnitude faster than strpos. We've actually used it quite well as a solution for something that couldn't deal with indexing.
But again, Peter's questions still all apply. I'd answer them before diving into a solution.
Would a hash function/table work? Or a Suffix Array/Tree?

Categories