I'm extremely new to Solr so go easy on me :)
I have a field for arguments sake stores a product sku! If the sku in a document was 'SKU12345' - how would I return the document if the query '1234' was entered?
I have previously tried using solr.EdgeNGramFilterFactory in the field type specific for the SKU but unfortunately this only works as a string prefix!
I want to try and avoid wild cards to keep performance optimal!
Thankssss :)
If you are new to Solr and you are beginning to implement features like this, I would recommend to read thoroughly through the chapter Understanding Analyzers, Tokenizers, and Filters of the reference guide. Since there are several ways to make your query match, but the best choice would depend on what you need.
Arun's suggestion is not bad, but the Ngrams alone are more geared to find general fractions of words. You would need this, if you want to do some sort of type-ahead or auto-completion. e.g. a User starts to type within an input field somewhere and you want to suggest previously made input that does match in fractions. If you try to make this match with Ngrams alone, your index may become quite large. Since you maybe required to index all permutations of the words to not miss the place where numbers/words start or end.
For your requirement I would tend to suggest the WordDelimiterFilter with splitOnNumerics="1". So the input SKU12345 would be indexed as follows
SKU12345
12345
SKU
So if a user searches for 12345 this would make a match.
If you want to match also fragments of that - like you said 1234 - I would then place a N-GramFilter afterwards. Then you will need to play around with minGramSize and maxGramSize. You will want to keep the gap between the two values low. Since the higher the gap the bigger your index will become.
e.g.
* minGramSize=4 and maxGramSize=5, gap of 1, few permutations
* minGramSize=1 and maxGramSize=5, gap of 4, more permutations
This depend on how small the user input shall be allowed to make a match.
If only the input shall match only from the start and shall not hit fragments in the middle, I would suggest the EdgeN-GramFilter as even better choice over the N-GramFilter. This will only generate fragments from the start of a word, not from the middle. This will lead to further reduction of the index size and better performance.
So if you want to make 2345 match SKU12345 you need Ngram, if only input as 1234 shall match SKU12345 EdgeNgram will do.
You can also set side to "back" to generate the ngrams from right to left.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
Related
I attempting what I thought would be a simple exercise, but unless I’m missing a trick, it seems anything but simple.
Im attempting to clean up user input into a form before saving it. The particular problem I have is with hyphenated town names. For example, take Bourton-on-the-Water. Assume the user has Caps lock on or puts spaces next to the hyphens of any other screw up that might come to mind. How do I, within reason, turn it into what it’s meant to be?
You can use trim() to remove whitespace (or other characters) from the beginning and end of a string. You can also use explode() to break strings into parts by a specified character and then recreate your string as you like.
I think the only way you can really accomplish this is by improving the way the user inputs their data.
For example use a postcode lookup system that enters an address based on what they type.
Or have a autocomplete from a predefined list of towns (similar to how Facebook shows towns).
To consider every possible permutation of Bourton On The Water / Bourton-On-The-Water etc... is pretty much impossible.
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I am in need of a lightweight fast search solution.
Today I use Fulltext in boolean mode, where every searchword is mandatory in the results.
The function is fast, working and meets the requirements.
BUT some of the fulltext limitations, http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html, have appeared to be a problem. The site is on a hosted server and Im not allowed to change the mysql settings (e.g. minimum lenght)
E.g.
the search must be able to find red, 11 and ab.cdwhich todays full text solution can't.
http://sphinxsearch.com/ is what you're looking for
though you have to understand that smaller words you find the bigger indexes you use.
Use Lucene, it's very often implemented with MySQL and it'll be both faster and more featureful.
Using the built-in FTS engine is relatively bad practice, especially since it doesn't work with the slightly more reliable InnoDB engine.
The only thing that would come to mind, would to be basing your search off the number of occurrences you can find. Your actual index method could vary, depending on what the DB offers.
Assuming DB size isn't an issue, a (very) basic approach would be to break the search blobs (say, a post on stackoverflow) into each word, normalize it (remove plurals, strip 'logic' words such as and, etc.) then insert each word as a new record, together with the ID that identifies your indexed resource.
Count the instances of the ID, order by count, higher number = more relevant.
Not exactly my field though, so tred carefully! =]
I'd recommend you try distance searching: Levenshtein
Or search for "N-gram fulltext indexing".
I haven't mucked around with it, but I read the theory of full text searching (with mysql at least) a little while back.
If memory serves me correctly you can use full text search for what you want, but you need to configure (and I think a recompile) to get it to work on smaller number of search characters. I think it is set to a default number of 4 characters. You'll want to change it to 2 characters in length with a few other options thrown in and test the results you get.
Someone correct me if this is incorrect. I would rather not throw him on a red herring.
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.
I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.