I've to match a string with strings in a column of mysql table, i need to select the strings which has more than 80% of match. is there any function in mysql will do this?
for example "quote by placing" string matches more than 80% for the string"quote by place". Like this i've filter.
Thanks!
A FULLTEXT search would probably be the best approach for what you're doing. No need to pick an arbitrary % otherwise.
If you're doing more intensive searching check out some of the engines like Sphinx
try chopping 20% of a string in php and then use like for 80% of your string and all strings to compare
You could try looking at the MYSQL soundex function that may give you what you need
As far as I understand your criteria, you want a fuzzy search. This is not implemented in MySQL. You will have to find a way to externalize the check for this.
Related
Im pulling in data from a 6 live feeds which is sometimes have slightly different formatting, ie. i might have
'arsenal' and 'arsenal fc'
'T Walcot' and 'Theo Walcot' and 'T. Walcot'
What i was wandering was, is there a simple way to check if the strings match each other on the basis of if they have a certain % of letters in the same order they would be considered the same.
I susppose i could setup a list of related words and terms, but this would mean having to setup it up in advance, but i was wandering if there was an easier, on the fly automated way as i wont be able to compile a full list for a long time.
There's a function just for that:
similar_text('Theo Walcott', 'T. Walcott', $similarity);
echo $similarity;
Have a look at the soundex function http://php.net/soundex and the similar_text function to get a percentage of similarity.
My PHP script needs to check for matches throughout an array of data. It's currently looking for exact string matches. I'd like it to be less strict.
For example, if the array holds the string "Tom and Jerry" I would like to return true for: "Tom Jerry", "Tom & Jerry" and maybe even "Tom and Jery". I found links to PHP search engines they are more complex and not really what I need. My data is fairly small and dynamic, so there's no indexing.
I know I could write a big hairy regular expression, but I'm pretty sure I would be reinventing the wheel, because I'm sure others have already done this. Any advice on where to look or how to approach this would be much appreciated.
EDIT: To clarify, I'm trying to avoid entering all the dynamically generated data into a DB.
If the data were in MySQL, you could use a full text search. This is quite easy to develop; the question is: would that be too heavy-weight of a solution?
It may require some trial and error but you could do:
Make a manual list of words that may be absent, such 'and', 'in', 'of', et cetera (such as in your Tom Jerry example).
Compute the Hamming distance between the string and the search query. If it is low (perhaps at most one or two), return true.
Otherwise, return false.
I just discovered two functions which appear to do what I want:
similar_text()
levenshtein()
Both seem to return an intiger representing the "closeness" of the match between two strings. The difference between the two is over my head.
My search was aided by this S.O. question.
Can you give me some tips how can i generate a suggestion based on the word entered by the user? Its not a misspelling thing, i wan't when a user enter the word: "hello" if the database does not contain the word "hello" but the word "helo" or "helol" suggest that.
Thank you.
FYI
You should look into PHP's levenshtein function, this finds closest matching words based on a score, using a dictionary file... I know you said it's not mispelling, but the dictionary file can be anything and you can have more than one, depending on how you want to use it
It will be way too complex to do with MySQL alone. You need to index commonly used words using something like Sphinx Search (a stand-alone full text search engine) and then run the queries against Sphinx.
There is a pretty good thread about it at http://sphinxsearch.com/forum/view.html?id=5898
You can use the Soundex function and compare submitted string to a dictionnary database, i.e.:
soundex("Hellllo") == soundex("Hello");
All you have to do, is storing your suggestions soundex in a table. Then when a user submit a word, you can search for his soundex hash in your table and return the words with the same / close pronounciation.
The soundex method is kind of fast, but your dictionnary table has to be indexed if you need performance.
I am in need of a lightweight fast search solution.
Today I use Fulltext in boolean mode, where every searchword is mandatory in the results.
The function is fast, working and meets the requirements.
BUT some of the fulltext limitations, http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html, have appeared to be a problem. The site is on a hosted server and Im not allowed to change the mysql settings (e.g. minimum lenght)
E.g.
the search must be able to find red, 11 and ab.cdwhich todays full text solution can't.
http://sphinxsearch.com/ is what you're looking for
though you have to understand that smaller words you find the bigger indexes you use.
Use Lucene, it's very often implemented with MySQL and it'll be both faster and more featureful.
Using the built-in FTS engine is relatively bad practice, especially since it doesn't work with the slightly more reliable InnoDB engine.
The only thing that would come to mind, would to be basing your search off the number of occurrences you can find. Your actual index method could vary, depending on what the DB offers.
Assuming DB size isn't an issue, a (very) basic approach would be to break the search blobs (say, a post on stackoverflow) into each word, normalize it (remove plurals, strip 'logic' words such as and, etc.) then insert each word as a new record, together with the ID that identifies your indexed resource.
Count the instances of the ID, order by count, higher number = more relevant.
Not exactly my field though, so tred carefully! =]
I'd recommend you try distance searching: Levenshtein
Or search for "N-gram fulltext indexing".
I haven't mucked around with it, but I read the theory of full text searching (with mysql at least) a little while back.
If memory serves me correctly you can use full text search for what you want, but you need to configure (and I think a recompile) to get it to work on smaller number of search characters. I think it is set to a default number of 4 characters. You'll want to change it to 2 characters in length with a few other options thrown in and test the results you get.
Someone correct me if this is incorrect. I would rather not throw him on a red herring.
How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex