I'm developing a system where users can create their own pesonal recipes with corresponding ingredients and save them (in mysql).
The problem is that every time an ingredient is saved i check if it allready exists in the ingredients table where i compare the names of the ingredients.
If i should be able to make properly shopping lists from the recipes i want to make sure that for example:
apple - apples - fresh apples
Cant apear
So if "apple" first is created and im trying to save "apples" i wanna check something similar allready exists.
Does an alghorithm like what im trying to explain allready exists?
Hope you have some input!
While it is possible to use soundex or Levenshtein distance, it would still require finding the key word in the phrase - with 'apple' and 'apples' it would probably work, but with 'dozen of fresh apples' - probably not.
From my experience, in that application nothing beats more manual algorithms:
create a base list of ingredients ("flour", "apple", "ham")
when adding new recipe, match ingredient list against the the list, possibly allowing for some fuzzyness using Levenshtein or regexes
create a backend page with a list of "original" vs "match", with a possibly to mark wrong matches with a single click
create a simple interface to do a manual matching for bad hits
You might have some luck with MySQL's SOUNDEX() function, assuming that the words are similar enough and, probably, simple enough.
Documentation can be found here: https://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Basically, what it does is reduce a given word to a four character string representing it. The string should be the same for any two words that sound largely the same.
In mySql you can use SOUNDEX() function soundex.
If you want to implement it in php there is levenshtein and similar_text functions
Related
Current Situation:
I am currently running a keyword search using multiple keywords in PHP and SQL. The field I'm applying the search to is the title field, which is a 250 VARCHAR field.
A user can input a single keyword, e.g. "apple" or also multiple, e.g. "apple banana yellow". The first option is trivial. For the second option, my current algorithm works like this:
Try and find items that match the exact entire string "apple banana yellow" in the title. Order the results by index id.
If no more results matching the exact entire string are found, or if none are found in the first place, search for all titles containing either "apple", "banana", or "yellow". Order the results by index id.
The algorithm is very basic but funny enough works pretty well.
What I'm looking for:
However I am now looking to implement a smarter search algorithm without having to rely on external paid scripts like Amazon services. I'm looking for a way to implement the following:
fuzzy search (I've read about SOUNDEX or levenshtein which may realize this)
smarter keyword search (Don't just either return items that match ALL words or JUST A SINGLE WORD, but maybe also 2 words or 3 words before)
order by relevance/likeness (Order by likeness of the search to the title, and not just the index id)
(Bonus: maybe even implement search for exact strings, like using " " on google to find exactly the words between the quotation marks)
What is the best way to get started with such a search? I am using InnoDB for MySQL.
Assuming MySQL, you can add a FULL Text index. Then, there are a number of functions that will allow you to so basic searches that meet all the needs you list: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
You end up using syntax like:
SELECT * FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
To see the match score
SELECT column_with_fulltext_index_on_it, MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE) AS score FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
It can be a little learning curve to overcome to understand how you can tweak the match clause perfect for your needs, but your examples seem pretty basic though (except the smarter search).
Also, good to note, there are system configs you need to control the the min/max characters of words/tokens to index by. You can read https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html to get deeper understanding of indexing options. Percona is a good resource as well https://www.percona.com/blog/2013/02/26/myisam-vs-innodb-full-text-search-in-mysql-5-6-part-1/ (typically more human digestible than the MySQL Doc's).
If you need to do more complex searches, you can look at adding other technologies like Solr, but I've always recommended, get the basic working with what you got, only adopt a new tech if you hit a brick wall, or have good metric on existing solution and know the new tech will somehow improve (speed, storage space, quality of results, etc...). If you can't quantify, stick to basic until you can.
Here's a good tutorial: http://www.w3resource.com/mysql/mysql-full-text-search-functions.php
Im pulling in data from a 6 live feeds which is sometimes have slightly different formatting, ie. i might have
'arsenal' and 'arsenal fc'
'T Walcot' and 'Theo Walcot' and 'T. Walcot'
What i was wandering was, is there a simple way to check if the strings match each other on the basis of if they have a certain % of letters in the same order they would be considered the same.
I susppose i could setup a list of related words and terms, but this would mean having to setup it up in advance, but i was wandering if there was an easier, on the fly automated way as i wont be able to compile a full list for a long time.
There's a function just for that:
similar_text('Theo Walcott', 'T. Walcott', $similarity);
echo $similarity;
Have a look at the soundex function http://php.net/soundex and the similar_text function to get a percentage of similarity.
I've got a website which lists sports scores. It current works like this:
Firstname Lastname 1-0 Firstname Lastname
It explodes this based on spaces, then explodes the third one (containing the scores) based on the - .
The problem with this is that it does not support names with more than 2 words. If I explode using - first, it would not support names with - in there. The results are added in a textarea, because I have many thousands that need to be added, so I don't want to make multiple fields to input data into, as I can currently add matches quickly listing one result per line. Does anyone have advice on how to make the system both multi-word, and special character-insensitive? Is there maybe a way to split when it encounters a number, then select the first chunk as the first name, the last as that players score, and the rest as the last name?
I don't know if there's any way to teach a simple parsing command, or even a regular expression, to do what you want. Some cases will always be ambiguous. For example, if you have the names `Mary Ann Steiner" and "Constantin Van Dyke" the patterns are exactly the same, but one needs to be split (2/1) and the other needs to be split (1/2).
You could possibly find a library that knows how to make educated guesses based on a huge dictionary of known names, but failing that...
I think in this case you need the human brain inputting the data to make some of the decisions, and indicate them during data entry. In my experience using multiple fields isn't that slow if you navigate using the tab key instead of mousing around. You could also enter the data using a delimiter of your own, like:
Mary Ann,Steiner,2-3
Constantin,Van Dyke,4-2
Then you'd run something that explodes those lines based on "," and enters the elements into your db.
If you're copy/pasting or scraping the data from an external site, another option would be to just explode every line using the method you're currently using. This should work for most records, and when it doesn't work, it will be obvious -- the resulting record will have too many elements. You can have your script flag just those records for human intervention.
Can you give me some tips how can i generate a suggestion based on the word entered by the user? Its not a misspelling thing, i wan't when a user enter the word: "hello" if the database does not contain the word "hello" but the word "helo" or "helol" suggest that.
Thank you.
FYI
You should look into PHP's levenshtein function, this finds closest matching words based on a score, using a dictionary file... I know you said it's not mispelling, but the dictionary file can be anything and you can have more than one, depending on how you want to use it
It will be way too complex to do with MySQL alone. You need to index commonly used words using something like Sphinx Search (a stand-alone full text search engine) and then run the queries against Sphinx.
There is a pretty good thread about it at http://sphinxsearch.com/forum/view.html?id=5898
You can use the Soundex function and compare submitted string to a dictionnary database, i.e.:
soundex("Hellllo") == soundex("Hello");
All you have to do, is storing your suggestions soundex in a table. Then when a user submit a word, you can search for his soundex hash in your table and return the words with the same / close pronounciation.
The soundex method is kind of fast, but your dictionnary table has to be indexed if you need performance.
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.