after i searched i found how to do a fuzzy searching on a string
but i have an array of strings
$search = {"a" => "laptop","b" => "screen" ....}
that i retrieved from the DB MySQL
IS there any php class or function that does fuzzy searching on an array of words
or at least a link with maybe some useful info's
i saw a comment that recommend using PostgreSQL
and it's fuzzy searching capability but
the company had already a MySQL DB
Is there any recommendation ??
You could do this in MySQL since you already have a MySQL database - How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete? which mentions the MySQL Double Metaphone implementation and has an implementation in SQL for MySQL 5.0+
Edit: Sorry answering here as there is more than could fit in a comment…
Since you've already accepted an answer using PHP Levenshtein function then I suggest you try that approach first. Software is iterative; the PHP array search may be exactly what you want but you have to test and implement it first against your requirements. As I said in your other question a find as you type solution might be the simplest solution here, which simply narrows the product as the user types. There might not be a need to implement any fuzzy searching since you are using the User to do the fuzzy search themselves :-)
For example a user starts typing S, a, m which allows you to narrow the products to those beginning with Sam. So you are always only letting the user select a product you already know is valid.
Look at the Levenshtein function
Basically it gives you the difference (in terms of cost) between to strings. I.e. what is the cost to transform string A into string B.
Set yourself a threshold levenshein distance and anything under that for two words mean they're similar.
Also the Bitap algorithm is faster since it can be implemented via bitwise operators, but I believe you will have to implement it yourself, unless there is a PHP lib for it somewhere.
EDIT
To use levenshtein method:
The search string is "maptop", and you set your "cost threshold" to say 2. That means you want any words that are two string transform operations away from your search string.
so you loop through your array "A" of strings until
levenshtein ( A[i] , searchString ) <= 2
That will be your match. However you may get more than one word that matches, so it is up to you how you want to handle the extra results.
Related
We are currently running our app on MySQL and are planning to move to MongoDB. We have moved some parts already but having issues with MongoRegex performance.
We have an autocomplete search box that joins 6 tables (indexed / non-indexed fields) and returns results super fast on mysql. The same thing on MongoDB performs really slow. It takes about 2.3 seconds only on one collection. The user has to wait for a long time. The connection time is 0.064 secs. Query time 2.36 seconds. I did a bit of Googling and couldn't find a perfect answer. Everyone said MongoRegex is slow. If that's true how are other companies overcoming this problem?
What is best way to improve autocomplete performance / experience when running in on MongoDB?
First of all, you will have to design your query carefully. Carefully as in, selecting properly indexed fields and designing accordingly. Also if you are using regex make sure you are writing the regex in a way which forces the query to use an indexed field. Something like /^prefix/ will do. [ See this link : http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use ]
I have seen many implementations using range query of mongodb, but Im not sure if thats the best one, since instantaneous results are a key thing.
Apart from that, I have seen some one who recommended prefix-trees. Which effectively stores the prefixes in a field, and then stores all the words starting with that particular prefix in the next fields as an array. This solution sounds convincing and fast, since the prefix fields are supposed to be indexed, but you will have to think about the storage factor also.
It is hard to tell as we don't have the search query, but worth mentioning that, after taking a look at the documentation, it appears that MongoDB is able to use an index when using RE search if the pattern is prefixed by a constant string only if it is anchored:
http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use
So, to quote the doc:
A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
But /abc.*/ ou /^.*abc/ will not use the index.
I have already Developed a Typing Software to capture Text Typed by candidates in my institutes using PHP & MySQL. In the continuation process, I am stuck with a strategic issue as to how should I compare the Similarity of Texts typed by the Candidates with the Standard Paragraph which I had given them to Type(in the form of Hard Copy, though the same copy is also stored in the MySQL database). My dilemma is that, whether I would use the Levensthein Distance Algorithm in PHP or in MySQL directly itself so that the performance issue is optimized. Actually. I am afraid if Programming in PHP would come out erroneous while evaluating the Texts. It is worthwhile to mention here that the Texts would be compared to get the rank on the basis of Words Typed Per Minute.
The simplest solution would be to utilize PHP's built-in levenshteindocs function to compare the two blocks of text. If you wanted to back the processing off to the MySQL database, you could implement the solution listed in Levenshtein: MySQL + PHPStackOverflow
Another PHP option might be the similar_textdocs function.
The unfortunate drawback for the PHP levenshtein function is that it cannot handle strings longer than 255 characters. As per the php manual docs:
This function returns the Levenshtein-Distance between the two
argument strings or -1, if one of the argument strings is longer than
the limit of 255 characters.
So, if your paragraphs are longer than that you may be forced to implement a MySQL solution, though. I suppose you could break the paragraphs up into 255-character blocks for comparison (though I can't say definitively that this won't "break" the levenshtein algorithm).
I'm not an expert in linguistics parsing and processing, so I can't speak to whether these are the best solutions (as you mention in your question). They are, however, very straightforward and simple to implement.
Essentially what I want to do is search a number of MYSQL databases and return results where a certain field is more than 50% similar to another record in the databases.
What am I trying to achieve?
I have a number of writers who add content to a network of websites that I own, I need a tool that will tell me if any of the pages they have written are too similar to any of the pages currently published on the network. This could run on post/update or as a cron... either way would work for me.
I've tried making something with php, drawing the records from the database and using the function similar_text(), which gives a % difference between two strings - this however is not a workable solution as you have to compare every entry against every other entry & I worked out with microtime that it would take around 80 hours to completely search all of the entries!
Wondering if it's even possible!?
Thanks!
You are probably looking for is SOUNDEX. It is the only sound based search in mysql. If you have A LOT of data to compare, you're probably going to need to pregenerate the soundex and compare the soundex columns or use it live like this:
SELECT * FROM data AS t1 LEFT JOIN data AS t2 ON SOUNDEX(t1.fieldtoanalyse) = SOUNDEX(t2.fieldtoanalyse)
Note that you can also use the
t1.fieldtoanalyze SOUNDS LIKE t2.fieldtoanalyze
syntax.
Finaly, you can save the SOUNDEX to a column, just create a column and:
UPDATE data SET fieldsoundex = SOUNDEX(fieldtoanalyze)
and then compare live with pregenerated values
More on Soundex
Soundex is a function that analyzes the composition of a word but in a very crude way. It is very useful for comparisons of "Color" vs "Colour" and "Armor" vs "Armour" but can also sometimes dish out weird results with long words because the SOUNDEX of a word is a letter + a 3 number code. There is just so much you can do sadly with these combinations.
Note that there is no levenstein or metaphone implementation in mysql... not yet, but probably, levenstein would have been the best for your case.
Anything is possible.
Without knowing your criteria for similar, it's difficult to offer a specific solution. However, my suggestion would be pre-build a similarity table, utilize a function such as similar_text(). Use this as your index table when searching by term.
You'll take an initial hit to build such an index. However, you can manage it easier as new records are added.
Thanks for your answers guys, for anyone looking for a solution to a problem similar to this I used the SOUNDEX function to pull out entries that had a similar title then compared them with the similar_text() function. Not quite a complete database comparison, but near as I could get it!
When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?
A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).
How about the levenshtein function, or similar_text function?
Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.
http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.
When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.
You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.
Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.
You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean
I'm looking for an efficient way (using PHP with a Mysql Database) to suggest alternative spelling for a query.
I know I can use services such as Yahoo's Spelling Suggestion but I want the suggestions to be based on what is currently available in the database.
For example: The user has to fill a form with a "City" field, and I want to make sure that everyone will use the same spelling for said city, (so I don't end up with people filling in "Pitsburgh" when what they mean is "Pittsburgh" ).
This was only an example but, basically I want to search what is already in the database for entries where the spelling is really close to what the user entered...
Any algorithm, tutorials or ideas on how to achieve this?
I would do it as the user types and suggest by prefix (ala Google Suggest). A trie would be nice for this. It wouldn't help to correct misspelled first letters, but those are pretty rare.
MySQL has a built-in function to find the Levenshtein edit distance, it's quite slow though. I'd use the auto-complete function offered above, or simply edit entries after-the-fact every week or so.
Maybe this will help http://jquery.bassistance.de/autocomplete/demo/
It uses JQuery (client side) and php (serverside).
The example feeds from an array but can be easily modified so it will use a MySQL database.
Spelling alternatives are often implemented by using the Levenshtein distance between two words (the one the user typed, en the one inside, for example, your database)
here is the pseudocode for the algorithm
(from wikipedia):
int LevenshteinDistance(char s[1..m], char t[1..n])
// d is a table with m+1 rows and n+1 columns
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i
for j from 0 to n
d[0, j] := j
for i from 1 to m
for j from 1 to n
{
if s[i] = t[j] then cost := 0
else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution
)
}
return d[m, n]
and here you can find the real implementation for all sorts of languages: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've used the pspell http://uk.php.net/pspell package to do this. Take the search term, check the spelling. If its not OK, PSPELL will make suggestions.
You can even run the suggestions though your search, count the results, and then say: Your search for "foo" returned 0 results. Did you mean "baz" (12 results) or "bar" (3 result).
If you are worried about performance, only do this when a search returns 0 results.
Please, take a look to the Yahoo! UI Library Autocomplete Component. I think it is just what you're looking for. The section "Using DataSources" explains how to use different kind of data sources, including server side based ones like yours.
Have a look at Javascript Examples it lists 13 different autocompleting field code.
I've used something similar on one of my sites, I essentially have a div layer set up under the text box, as a user types this fires of an Ajax based HTTP request to my SQL query script which updates each letter they type. The div gets updated with any matching DB entries which the user can click on to select.
I believe SoundEx is a better fit than Levenshtein distance.
SoundEx is a function that produces a hash of a word/phrase based on the sound it would make in English. It is great for helping people who can't spell match the canonical spelling.
I have used it very successfully to find when two people registered the same company in a database with slightly different variants on the name.
SoundEx is built into MySql. Here is one tutorial on its use.