My PHP script needs to check for matches throughout an array of data. It's currently looking for exact string matches. I'd like it to be less strict.
For example, if the array holds the string "Tom and Jerry" I would like to return true for: "Tom Jerry", "Tom & Jerry" and maybe even "Tom and Jery". I found links to PHP search engines they are more complex and not really what I need. My data is fairly small and dynamic, so there's no indexing.
I know I could write a big hairy regular expression, but I'm pretty sure I would be reinventing the wheel, because I'm sure others have already done this. Any advice on where to look or how to approach this would be much appreciated.
EDIT: To clarify, I'm trying to avoid entering all the dynamically generated data into a DB.
If the data were in MySQL, you could use a full text search. This is quite easy to develop; the question is: would that be too heavy-weight of a solution?
It may require some trial and error but you could do:
Make a manual list of words that may be absent, such 'and', 'in', 'of', et cetera (such as in your Tom Jerry example).
Compute the Hamming distance between the string and the search query. If it is low (perhaps at most one or two), return true.
Otherwise, return false.
I just discovered two functions which appear to do what I want:
similar_text()
levenshtein()
Both seem to return an intiger representing the "closeness" of the match between two strings. The difference between the two is over my head.
My search was aided by this S.O. question.
Related
Im pulling in data from a 6 live feeds which is sometimes have slightly different formatting, ie. i might have
'arsenal' and 'arsenal fc'
'T Walcot' and 'Theo Walcot' and 'T. Walcot'
What i was wandering was, is there a simple way to check if the strings match each other on the basis of if they have a certain % of letters in the same order they would be considered the same.
I susppose i could setup a list of related words and terms, but this would mean having to setup it up in advance, but i was wandering if there was an easier, on the fly automated way as i wont be able to compile a full list for a long time.
There's a function just for that:
similar_text('Theo Walcott', 'T. Walcott', $similarity);
echo $similarity;
Have a look at the soundex function http://php.net/soundex and the similar_text function to get a percentage of similarity.
So, suppose I have a simple array of sentences. What would be the best way to search it based on user input, and return the closest match?
The Levenshtein functions seem promising, but I don't think I want to use them. User input may be as simple as highest mountain, in which case I'd want to search for the sentence in the array that has highest mountain. If that exact phrase does not exist, then I'd want to search for the sentence that has highest AND mountain, but not back-to-back, and so on. The Levenshtein functions work on a per-character basis, but what I really need is a per-word basis.
Of course, to some degree, Levenshtein functions may still be useful, as I'd also want to take into account the possibility of the sentence containing the phrase highest mountains (notice the S) or similar.
What do you suggest? Are there any systems for PHP that do this that already exist? Would Levenshtein functions alone be an adequate solution? Is there a word-based Levenshtein function that I don't know about?
Thanks!
EDIT - I have considered both MySQL fulltext search, and have also considered the possibility of breaking both A) input and B) each sentence into separate arrays of words, and then compared that way, using Levenshtein functions to account for variations in words. (color, colour, colors, etc) However, I am concerned that this method, though possibly clever, may be computationally taxing.
As I am not a fan of writing code for you, I would normally ask you what you have tried first. However, I was currently stuck on something, so took a break to write this:
$results=array();
foreach($array as $sentence){
if(stripos($sentence,$searchterm)!==false)
$results[]=$sentence;
}
if(count($results)==0){
$wordlist=explode(" ",$searchterm);
foreach($wordlist as $word){
foreach($array as $sentence){
if(stripos($sentence,$word)!==false)
$results[]=$sentence;
}
}
}
print_r($results);
This will search an array of sentences for terms exactly. It will not find a result if you typed in "microsift" and the sentence had the word "Microsoft". It is case insensitive, so it should work better. If no results are found using the full term, it is broken up and searched by word. Hope this at least points you to a starting place.
Check this: http://framework.zend.com/manual/en/zend.search.lucene.overview.html
Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
There are not built-in functions for PHP to do this. This is because what you are asking for involves search relevance, related terms, iterative searching, and many more complex operations that need to mimic human logic in searching. You can try looking for PHP-based search classes, although the ones that I know are database search engines rather than array search classes. Making your own is prohibitively complex.
I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf
There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.
Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?
EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.
I think you can solve this problem by using "Levenshtein distance" metric.
From wikipedia;
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences.
Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.
As of PHP 5.5, PHP's strtr uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.
A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick
How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.