Here is the task.
I need to recognize whether a string contains some town name.
Another words - a recognition of a town from some text.
As input i have text to search against AND geocode.
Depending on geocode list of towns are loaded from db.
Now, current implementations is i loop over list of those towns and try to match it with the use of short circuit evaluation.
Like:
if (stripos($text, $currentTown) !== false &&
preg_match("#\b$currentTown\b#i", $text)) {
// add town to recognized list
}
And the problem is i have e.g. list of towns for UK (which is about 40 000) the loop will take "quite a while".
So my question is how do i optimize the recognition time.
Maybe there is some advanced search in the array?
Any ideas are welcome.
Thanks.
Although my best bet instantly was to use 'MySQL full text search' I will attempt to solve your problem. I will try to start with 'best results'.
Keep all your town data in lowercase (or atleast where you search in) and use $text = strtolower($text); before searching: so you can use strpos Case sensitive search > insensitive search
Why bother with preg_match(); as your doing 99% the same thing with stripos. You can skip it.
Perhaps add small checks like if strlen($text) < 4 don't even try to search as it gives horrible results.
Order your data by length (this is super expensive so do this once and store it) and skip the currentTowns that are shorter than the input.
Order your data alphabetical and only go through the part which matches the first letter (or first + second even)
Possibly, cache results / searches. Then you only have to search through your cache if it can find some row (but ye cache miss hurts)
If you have large data sets, maybe the PHP Iterator class can help out. It could speed up the process of going over each record.
Related
I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.
So, suppose I have a simple array of sentences. What would be the best way to search it based on user input, and return the closest match?
The Levenshtein functions seem promising, but I don't think I want to use them. User input may be as simple as highest mountain, in which case I'd want to search for the sentence in the array that has highest mountain. If that exact phrase does not exist, then I'd want to search for the sentence that has highest AND mountain, but not back-to-back, and so on. The Levenshtein functions work on a per-character basis, but what I really need is a per-word basis.
Of course, to some degree, Levenshtein functions may still be useful, as I'd also want to take into account the possibility of the sentence containing the phrase highest mountains (notice the S) or similar.
What do you suggest? Are there any systems for PHP that do this that already exist? Would Levenshtein functions alone be an adequate solution? Is there a word-based Levenshtein function that I don't know about?
Thanks!
EDIT - I have considered both MySQL fulltext search, and have also considered the possibility of breaking both A) input and B) each sentence into separate arrays of words, and then compared that way, using Levenshtein functions to account for variations in words. (color, colour, colors, etc) However, I am concerned that this method, though possibly clever, may be computationally taxing.
As I am not a fan of writing code for you, I would normally ask you what you have tried first. However, I was currently stuck on something, so took a break to write this:
$results=array();
foreach($array as $sentence){
if(stripos($sentence,$searchterm)!==false)
$results[]=$sentence;
}
if(count($results)==0){
$wordlist=explode(" ",$searchterm);
foreach($wordlist as $word){
foreach($array as $sentence){
if(stripos($sentence,$word)!==false)
$results[]=$sentence;
}
}
}
print_r($results);
This will search an array of sentences for terms exactly. It will not find a result if you typed in "microsift" and the sentence had the word "Microsoft". It is case insensitive, so it should work better. If no results are found using the full term, it is broken up and searched by word. Hope this at least points you to a starting place.
Check this: http://framework.zend.com/manual/en/zend.search.lucene.overview.html
Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
There are not built-in functions for PHP to do this. This is because what you are asking for involves search relevance, related terms, iterative searching, and many more complex operations that need to mimic human logic in searching. You can try looking for PHP-based search classes, although the ones that I know are database search engines rather than array search classes. Making your own is prohibitively complex.
This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.
Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.
Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.
This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.
You could try using google n-grams to achieve this.
If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
function getRight($arr){
$_final='';
$count=count($arr);
// Remove multi spaces AND get string lengths
for($i=0;$i<$count;$i++){
$arr[$i]=preg_replace('/\s\s+/', ' ',$arr[$i]);
$len[$i]=strlen($arr[$i]);
}
// Max length
$_max=max($len);
for($i=0;$i<$_max;$i++){
$_el=array();
for($j=0;$j<$count;$j++){
// Cheking letter counts
$_letter=$arr[$j][$i];
if(isset($_el[$_letter]))$_el[$_letter]++;
else$_el[$_letter]=1;
}
//Most probably count
list($mostProbably) = array_keys($_el, max($_el));
$_final.=$mostProbably;
// If probbaly example is not space
if($_el!=' '){
// THERE NEED TO BE CODE FOR REMOVING SPACE FROM LINES WHERE $text[$i] is space
}
}
return $_final;
}
echo getRight($text);
Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?
trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)
How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count
What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.
Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.
I am currently performing a full text search on my "pages" in a database. While users get the results they want, I am unable to provide them with relevant information as to why in the world the results that came up, came up.
Specifications on what I am looking for:
I have HTML Data, meaning that if you search for a term such as "test" and the resulting page contained, <b>here is some test</b> page. I should be able to highlight the term without adversely affecting the html code on the page.
I only want to return a portion of the document, much like google does; where the portion returned contains a good portion of my search terms. How can I determine which portion contains the most terms? Would it be best to determine which section returns the most terms overall, or the section that has the most of the individual search terms, or a combination of both? Or should multiple snipits of information be included?
I would like to do this server side, if that is a viable option?
I am unsure as to what the best way of going about doing these two things are. I do know of one issue that can easily be overlooked that needs to be taken into account:
a. Snipping off html data at random points can completely ruin the page if you are not careful, for example, not closing a div tag can throw my whole layout off. What are the best solutions around this?
What are the best methods for achieving a search system like the one above?
I would not keep the HTML formatting in the search results. That would make your results page very messy. It doesn't make sense to include headings, line breaks, images, paragraph margins, etc. in the result descriptions--especially if you're only going to be printing short excerpt of truncated content.
I think in most cases, a result that matches 100% of the search terms one time is going to be more relevant than a result that only matches 50% of the search terms repeated twice. But this also depends on the query.
That's the only viable option, unless you want to send the client all of the result pages at once.
Since you're using MySQL's built-in fulltext search function, you can't really show the user why the results are what they are--not without a detailed understanding of how the fulltext search determines relevance. What you can do is show the user excerpts from each page that may be relevant to their search and may help them make useful determinations of which results to look into.
I would first strip the page content of any markup using strip_tags(), then explode() the content into an array of individual sentences. Then you could iterate through the array to determine the relevance of each sentence and then simply display the most relevant sentence(s) to the user. If the most relevant sentence is too long, then truncate it at word boundaries.
$text = strip_slashes($content);
$sentences = explode('. ', $text);
$relevance = array();
foreach ($sentences as $i=>$sentence) {
$rel = 0;
$relevance[$i] = calcRel($sentence);
}
arsort($relevance);
list($i, $j) = array_keys($relevance);
$ellips = (abs($i-$j)>1?'...':'');
if ($i < $j) {
$description = $sentences[i].$ellips.$sentences[j];
} else {
$description = $sentences[j].$ellips.$sentences[i];
}
calcRel($sentence) would return a numeric value representing relevance calculated by:
Searching $sentence for the entire query string. For each occurrence, the relevance number would be increased by 2^n; where n is the number of words in the query string.
Search for partial matches--again weighted by 2^n; n being the number of words matched.
Search for individual query words, giving each match a weight of 1.
Lastly, in each of the above searches, the matching words/phrases should be removed from $sentence so they aren't counted more than once.
An alternate strategy could be just to scan the entire text for the search terms, recording the position of each match. Then using simple arithmetic, you can find the tightest cluster of search keywords and select your excerpt that way, truncating at word boundaries or sentence boundaries.
try preg_match(); with preg_replace();