I have a string which has the length of an average sentence, it can be made up of any random words. I also have a file (around 600kb) which contains some more random words.
I want to find out the common words between these two as efficiently as possible. Right now, I am going over two loops to match each word from the string against each word in the file but that seems a bit inefficient. Is there a better and more efficient way to get the common words?
Load one set into an array keys (values can be anything). Then loop the other set and test whether the array has those keys. This way you don't have two nested loops, but two independent ones (load loop and test loop), and key lookup is easy and fast when compared to the value lookup.
If you are testing multiple sentences against one file, loading the file into the array is clearly better. If your file is larger than your memory (shouldn't happen really, not with 600kb), then do it the other way around.
Alternately you can just make two arrays, then use array_intersect or array_intersect_key. If PHP is smart, array_intersect_keys will use the above procedure; in any case it should be good because it is implemented in C. The downside is you must load everything into memory (again, probably not an issue).
Your current algorithm complexity is O(N*M). To improve it, you can use hashtable to store the words from the file. In PHP, associative arrays are implemented as hashtables. So your array will look like this
$array = ['abc' => true, 'dfg' => true, ]// and so on
and use array_key_exists to check if word is in array. This gives you O(1) on validation. And finally, you have to iterate the words in your sentences. It will be O(N), where N is a number of words. Final complexity is O(N)
Related
I am working on a project and I need your suggestions in a database query. I am using PHP and MySQL.
Context
I have a table named phrases containing a phrases column in which there are phrases stored, each of which consists of one to three words.
I have a text string which contains 500 - 1000 words
I need to highlight all the phrases in the text string which exist in my phrases database table.
My solution
I go through every phrase in the phrase list and compare it against the text, but the number of phrases is large (100k) so it takes about 2 min or more to do this matching.
Is there any more efficient way of doing this?
I'm gonna focus on how to do the comparision part with 100K Values. This will require two steps.
a) Write a C++ library and link it to PHP using an extension. Google PHP-CPP. There is a framework which allows you to do this.
b) Inside C/C++ , you need to create a data structure which has a time complexity of O(n) . n being length of the phrases you're searching for. Normally, this is called a tries data structure. This is conventionally used for words without space[not phrases]. but, surely you can write your own.
Here is a link, which contains the word implementation. aka dictionary.
http://www.geeksforgeeks.org/trie-insert-and-search/
This takes quite a bit of Memory since, the number is 100K. fair to say, you need a large system. But, when you're looking for better performance, then, Memory tends to be a tradeoff.
Alternative Approach
Only PHP. Here , extract phrases from your text input. Convert them into a Hash. the table data that you contain, should also be stored in a hash. [Needs Huge Memory]. The performance here will be rocket fast, per search aka O(1). so, for a sentence of k words. your time complexity will be O(K-factorial).
So, suppose I have a simple array of sentences. What would be the best way to search it based on user input, and return the closest match?
The Levenshtein functions seem promising, but I don't think I want to use them. User input may be as simple as highest mountain, in which case I'd want to search for the sentence in the array that has highest mountain. If that exact phrase does not exist, then I'd want to search for the sentence that has highest AND mountain, but not back-to-back, and so on. The Levenshtein functions work on a per-character basis, but what I really need is a per-word basis.
Of course, to some degree, Levenshtein functions may still be useful, as I'd also want to take into account the possibility of the sentence containing the phrase highest mountains (notice the S) or similar.
What do you suggest? Are there any systems for PHP that do this that already exist? Would Levenshtein functions alone be an adequate solution? Is there a word-based Levenshtein function that I don't know about?
Thanks!
EDIT - I have considered both MySQL fulltext search, and have also considered the possibility of breaking both A) input and B) each sentence into separate arrays of words, and then compared that way, using Levenshtein functions to account for variations in words. (color, colour, colors, etc) However, I am concerned that this method, though possibly clever, may be computationally taxing.
As I am not a fan of writing code for you, I would normally ask you what you have tried first. However, I was currently stuck on something, so took a break to write this:
$results=array();
foreach($array as $sentence){
if(stripos($sentence,$searchterm)!==false)
$results[]=$sentence;
}
if(count($results)==0){
$wordlist=explode(" ",$searchterm);
foreach($wordlist as $word){
foreach($array as $sentence){
if(stripos($sentence,$word)!==false)
$results[]=$sentence;
}
}
}
print_r($results);
This will search an array of sentences for terms exactly. It will not find a result if you typed in "microsift" and the sentence had the word "Microsoft". It is case insensitive, so it should work better. If no results are found using the full term, it is broken up and searched by word. Hope this at least points you to a starting place.
Check this: http://framework.zend.com/manual/en/zend.search.lucene.overview.html
Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
There are not built-in functions for PHP to do this. This is because what you are asking for involves search relevance, related terms, iterative searching, and many more complex operations that need to mimic human logic in searching. You can try looking for PHP-based search classes, although the ones that I know are database search engines rather than array search classes. Making your own is prohibitively complex.
I'm attempting to create an algorithm that will suggest Mad Gab style phrases.
The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:
loop over phrases (character by character)
if keyword is found
store keyword and branch (recursion)
increment character count
However, the problems I am running into are:
Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
Allow literal terms - "the", "and", "one", "two", "three".
How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.
I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.
I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.
Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:
http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584
If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:
http://www.tug.org/docs/liang/
Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.
Vowels:
http://www.eslgold.com/pronunciation/english_vowel_sounds.html
Consonants:
http://usefulenglish.ru/phonetics/english-consonant-sounds
Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.
Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.
Example:
"Logical consequence"
A. Syllable split
"lo gi cal con se quence"
B. Vowel Sounds applied
"lah gee cahl con see quince"
C. Consonant Sounds applied
"lah jee kahl kon see quinse"
D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)
"Law Gee Call Con Sea Quints"
Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.
An example of a Soundex MySQL comparison is:
select strcmp(soundex('lah'), soundex('law'));
I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.
My suggestion may be inefficient, but optimization is a different question.
Update:
I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".
Taking a different tack from Jonathan Barlow's solution, I recommend an O(n2) algorithm that gives you the properties you seek, in randomness, robustness, and scalable difficulty. The complexity of this algorithm can be further improved in constant time or with optimizations to the modality of the search, but because the size of your input phrases is guaranteed to be small, it's not that big a deal.
Construct a hash table of all known words in the Oxford English Dictionary and a map of lists of words by soundex() value. This initially sounds intractable, until you realize that there aren't actually that many of them in current use. Assuming a decent one-way hashing algorithm, this should take several megabytes, tops.
Consider the words in your input phrase as a single, compressed string of characters with no word identity whatsoever, discarding whitespace and all punctuation. From this, walk the space for all character lengths, starting with a length of one, up to the full length of the amalgamated phrase minus one. For each string produced by this walk, perform a hash lookup against OED. When a word is encountered that's present in the dictionary, append its word and position to the end of a list in memory.(This pass will always take sum(n) time, which is by definition 0.5n(n+1). So, O(n2) it is. Its space complexity is worst-case O(n2), but in practice, a fully connected set of terms is extremely unlikely.)
Now comes your difficulty slider. From the produced list, chop off the first N% of the found terms, where N is your level of difficulty. The principle here is, smaller words are easier for someone to lexically process, while longer words are more difficult to sound out and differentiate.
Construct an array conforming to the original length of the phrase (without spaces and punctuation) and shuffle your list of encountered words. Now, walk the shuffled list. For each element, verify if all of the slots in the array are free for that word at its original position. If they are, keep the word and its position, marking the slots as used in the array. If they are not, iterate to the next word until the list is exhausted.*
From the final output array, construct a partitioned list of unused characters in the space, treating each bag of characters as its own phrase. For this list, perform syllable detection exactly as sketched out here, passing the results to metaphone() with a percentage chance of glomming two or more syllables together. Then, for the bag of output dictionary words from 4., perform soundex(), pulling a random word from the word's mapped list of comparable soundex values. For every word that can only soundex() to itself according to the backing map of lists, perform partitioning and metaphone(). Finally, stitch the two lists of results together by sorting on position and print your result.
This is a random algorithm with what I believe to be all of the desired properties, but it's still rough in my mind.
* Extra credit: determine the allowed overlaps for your system by character or syllable. This can make for an even larger gamut of accepted output phrases and a much higher level of difficulty.
I have an array of words, I need to figure out how many words each letter shows up in. The number of times per word doesn't matter, only the number of words.
I only need to check on a-z, but since the array of words can be quite large (over 100,000) at a time, 26 iterations through the whole loop will take far too long for this.
What's a quicker way to check this? 260,000 loops is far too many for this.
You have to loop through all the words, you could use count_chars to give you all the unique letters used in each word quickly... but other than that there isn't much you can do. You can either test all letters against a word str_split and array_unique, or you can split the word into letters and find the unique ones count_chars.
EDIT: If you are looking for absolute performance then you simply have to try all the different combinations. Point being, algorithm-wise, there isn't much one can do if your data is dynamic or "unknown".
I have two very large strings. How can I compare them to tell if they're identical, or if one of them is different than the other? (That way I can leave the identical strings alone and process the ones that have changed).
The most efficient way is to do:
$string1 == $string2
It does a bit-wise comparison of the strings, so should be at worst O(n) where n is the size of the smaller string. I don't think you're going to get much better than that (short of keeping track of the strings and if they were changed, but the way your question is worded it seems like all you want to do is compare them).
You could compare hash values, or create a wrapper class containing the string in question and a "changed" flag that is set to true each time the string is altered.