Can you give me some tips how can i generate a suggestion based on the word entered by the user? Its not a misspelling thing, i wan't when a user enter the word: "hello" if the database does not contain the word "hello" but the word "helo" or "helol" suggest that.
Thank you.
FYI
You should look into PHP's levenshtein function, this finds closest matching words based on a score, using a dictionary file... I know you said it's not mispelling, but the dictionary file can be anything and you can have more than one, depending on how you want to use it
It will be way too complex to do with MySQL alone. You need to index commonly used words using something like Sphinx Search (a stand-alone full text search engine) and then run the queries against Sphinx.
There is a pretty good thread about it at http://sphinxsearch.com/forum/view.html?id=5898
You can use the Soundex function and compare submitted string to a dictionnary database, i.e.:
soundex("Hellllo") == soundex("Hello");
All you have to do, is storing your suggestions soundex in a table. Then when a user submit a word, you can search for his soundex hash in your table and return the words with the same / close pronounciation.
The soundex method is kind of fast, but your dictionnary table has to be indexed if you need performance.
Related
I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.
I'm developing a system where users can create their own pesonal recipes with corresponding ingredients and save them (in mysql).
The problem is that every time an ingredient is saved i check if it allready exists in the ingredients table where i compare the names of the ingredients.
If i should be able to make properly shopping lists from the recipes i want to make sure that for example:
apple - apples - fresh apples
Cant apear
So if "apple" first is created and im trying to save "apples" i wanna check something similar allready exists.
Does an alghorithm like what im trying to explain allready exists?
Hope you have some input!
While it is possible to use soundex or Levenshtein distance, it would still require finding the key word in the phrase - with 'apple' and 'apples' it would probably work, but with 'dozen of fresh apples' - probably not.
From my experience, in that application nothing beats more manual algorithms:
create a base list of ingredients ("flour", "apple", "ham")
when adding new recipe, match ingredient list against the the list, possibly allowing for some fuzzyness using Levenshtein or regexes
create a backend page with a list of "original" vs "match", with a possibly to mark wrong matches with a single click
create a simple interface to do a manual matching for bad hits
You might have some luck with MySQL's SOUNDEX() function, assuming that the words are similar enough and, probably, simple enough.
Documentation can be found here: https://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Basically, what it does is reduce a given word to a four character string representing it. The string should be the same for any two words that sound largely the same.
In mySql you can use SOUNDEX() function soundex.
If you want to implement it in php there is levenshtein and similar_text functions
So, suppose I have a simple array of sentences. What would be the best way to search it based on user input, and return the closest match?
The Levenshtein functions seem promising, but I don't think I want to use them. User input may be as simple as highest mountain, in which case I'd want to search for the sentence in the array that has highest mountain. If that exact phrase does not exist, then I'd want to search for the sentence that has highest AND mountain, but not back-to-back, and so on. The Levenshtein functions work on a per-character basis, but what I really need is a per-word basis.
Of course, to some degree, Levenshtein functions may still be useful, as I'd also want to take into account the possibility of the sentence containing the phrase highest mountains (notice the S) or similar.
What do you suggest? Are there any systems for PHP that do this that already exist? Would Levenshtein functions alone be an adequate solution? Is there a word-based Levenshtein function that I don't know about?
Thanks!
EDIT - I have considered both MySQL fulltext search, and have also considered the possibility of breaking both A) input and B) each sentence into separate arrays of words, and then compared that way, using Levenshtein functions to account for variations in words. (color, colour, colors, etc) However, I am concerned that this method, though possibly clever, may be computationally taxing.
As I am not a fan of writing code for you, I would normally ask you what you have tried first. However, I was currently stuck on something, so took a break to write this:
$results=array();
foreach($array as $sentence){
if(stripos($sentence,$searchterm)!==false)
$results[]=$sentence;
}
if(count($results)==0){
$wordlist=explode(" ",$searchterm);
foreach($wordlist as $word){
foreach($array as $sentence){
if(stripos($sentence,$word)!==false)
$results[]=$sentence;
}
}
}
print_r($results);
This will search an array of sentences for terms exactly. It will not find a result if you typed in "microsift" and the sentence had the word "Microsoft". It is case insensitive, so it should work better. If no results are found using the full term, it is broken up and searched by word. Hope this at least points you to a starting place.
Check this: http://framework.zend.com/manual/en/zend.search.lucene.overview.html
Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
There are not built-in functions for PHP to do this. This is because what you are asking for involves search relevance, related terms, iterative searching, and many more complex operations that need to mimic human logic in searching. You can try looking for PHP-based search classes, although the ones that I know are database search engines rather than array search classes. Making your own is prohibitively complex.
I've to match a string with strings in a column of mysql table, i need to select the strings which has more than 80% of match. is there any function in mysql will do this?
for example "quote by placing" string matches more than 80% for the string"quote by place". Like this i've filter.
Thanks!
A FULLTEXT search would probably be the best approach for what you're doing. No need to pick an arbitrary % otherwise.
If you're doing more intensive searching check out some of the engines like Sphinx
try chopping 20% of a string in php and then use like for 80% of your string and all strings to compare
You could try looking at the MYSQL soundex function that may give you what you need
As far as I understand your criteria, you want a fuzzy search. This is not implemented in MySQL. You will have to find a way to externalize the check for this.
How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex