I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.
Related
I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.
I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.
I'm extremely new to Solr so go easy on me :)
I have a field for arguments sake stores a product sku! If the sku in a document was 'SKU12345' - how would I return the document if the query '1234' was entered?
I have previously tried using solr.EdgeNGramFilterFactory in the field type specific for the SKU but unfortunately this only works as a string prefix!
I want to try and avoid wild cards to keep performance optimal!
Thankssss :)
If you are new to Solr and you are beginning to implement features like this, I would recommend to read thoroughly through the chapter Understanding Analyzers, Tokenizers, and Filters of the reference guide. Since there are several ways to make your query match, but the best choice would depend on what you need.
Arun's suggestion is not bad, but the Ngrams alone are more geared to find general fractions of words. You would need this, if you want to do some sort of type-ahead or auto-completion. e.g. a User starts to type within an input field somewhere and you want to suggest previously made input that does match in fractions. If you try to make this match with Ngrams alone, your index may become quite large. Since you maybe required to index all permutations of the words to not miss the place where numbers/words start or end.
For your requirement I would tend to suggest the WordDelimiterFilter with splitOnNumerics="1". So the input SKU12345 would be indexed as follows
SKU12345
12345
SKU
So if a user searches for 12345 this would make a match.
If you want to match also fragments of that - like you said 1234 - I would then place a N-GramFilter afterwards. Then you will need to play around with minGramSize and maxGramSize. You will want to keep the gap between the two values low. Since the higher the gap the bigger your index will become.
e.g.
* minGramSize=4 and maxGramSize=5, gap of 1, few permutations
* minGramSize=1 and maxGramSize=5, gap of 4, more permutations
This depend on how small the user input shall be allowed to make a match.
If only the input shall match only from the start and shall not hit fragments in the middle, I would suggest the EdgeN-GramFilter as even better choice over the N-GramFilter. This will only generate fragments from the start of a word, not from the middle. This will lead to further reduction of the index size and better performance.
So if you want to make 2345 match SKU12345 you need Ngram, if only input as 1234 shall match SKU12345 EdgeNgram will do.
You can also set side to "back" to generate the ngrams from right to left.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.
Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.
Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.
This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.
You could try using google n-grams to achieve this.
If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
function getRight($arr){
$_final='';
$count=count($arr);
// Remove multi spaces AND get string lengths
for($i=0;$i<$count;$i++){
$arr[$i]=preg_replace('/\s\s+/', ' ',$arr[$i]);
$len[$i]=strlen($arr[$i]);
}
// Max length
$_max=max($len);
for($i=0;$i<$_max;$i++){
$_el=array();
for($j=0;$j<$count;$j++){
// Cheking letter counts
$_letter=$arr[$j][$i];
if(isset($_el[$_letter]))$_el[$_letter]++;
else$_el[$_letter]=1;
}
//Most probably count
list($mostProbably) = array_keys($_el, max($_el));
$_final.=$mostProbably;
// If probbaly example is not space
if($_el!=' '){
// THERE NEED TO BE CODE FOR REMOVING SPACE FROM LINES WHERE $text[$i] is space
}
}
return $_final;
}
echo getRight($text);
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.