I have a word list and I want to unscramble words using this word list, in PHP.
It seems to me that PHP doesn't have a built-in function that does this. So could someone please suggest a good algorithm to do this, or at least point me in the right direction?
EDIT: edited to add example
So basically, what I'm talking about is I have a list of words:
apple
banana
orange
Then, I'm given a bunch of jumbled letters.
pplea
nanaba
eroang
Given a dictionary of known words:
foreach ($list as $word)
{
if (count_chars($scrambled_word,1) == count_chars($word,1))
echo "$word\n";
}
Edit: A simple optimization would be to move the count_chars($scrambled_word,1)) outside the loop since it never changes:
$letters = count_chars($scrambled_word,1)
foreach ($list as $word)
{
if ($letters == count_chars($word,1))
echo "$word\n";
}
Warning: I rarely use PHP, so this is dealing only with a general algorithm that should work in almost any language, not anything specific to PHP.
Presumably you have a word in which the letters have been rearranged, and you want to find what word(s) could be made from those letters.
If that's correct, the general idea is fairly simple: take a copy of your word list, and sort the letters in each word into alphabetical order. Put the sorted and unsorted versions of each word side by side, and sort the whole thing by the sorted words (but keeping each unsorted word along with its sorted version). You may want to collapse duplicates together, so that (for example) instead of {abt : bat} and {abt : tab}, you have: {abt: bat, tab}
Then, to match up a scrambled word, sort its letters in alphabetical order. Look for matches in your dictionary (since it's sorted, you can use a binary search). When you find a match, the result is the word (or words) associated with that sorted letter group. Using the example above, if the scrambled word was "tba", you'd sort it to get "abt", then look up "abt" to get "bat" and "tab".
Edit: As #Moron pointed out in the comments, sorting and binary search aren't really crucial points in themselves. The basic points are to turn all equivalent inputs into identical keys, then use some sort of fast lookup by key to find the word(s) for that key.
Sorting the letters in each word is one easy way to turn equivalent inputs into identical keys. Sorting the list and doing a binary search is one easy way to do fast lookups by key.
In both cases, there are quite a few alternatives. I'm not at all sure the alternatives are likely to improve performance a lot, but they certainly could.
Just for example, instead of a pure binary search you could have a second level of index that told you where the keys starting with 'a' were, the keys starting with 'b', and so on. Given that a couple of extremely frequently-used letters are near the beginning of the alphabet ('e' and 'a', for example) you might be better off sorting the words so that relatively uncommon letters ('q', 'z', etc.) are toward the front of the key, and the most commonly used letters are at the end. This would give that first lookup based on the initial character the greatest discrimination.
On the sort/binary search side, there are probably more alternatives, and probably better arguments to be made in favor of using something else. Hash tables typically allow lookups in (nearly) constant time. Tries can reduce storage substantially, especially when many words share a common prefix. The only obvious disadvantage is that the code for either one is probably more work (though PHP's array type is hash-based, so you could probably use it quite nicely).
It is possible to unscramble in O(log p + n) where
p = size of dictionary
n = length of word to be unscrambled
Assume a constant, c, the most occurrences of some letter within any word plus 1.
Assume a constant, k, the number of letters in the alphabet.
Assume a constant, j, the most number of words that can share the same hash or letter-sorted version.
Initialization of O(p) space:
1. Using the dictionary, D, create an associated list of letter sorted words, L, which will be size at most p since each word has a one sorted version.
2. Associate another column to L with a numerical hash of integers which can range [0, c^k-1].
3. For each word in L, generate its hash with the following function:
hash(word) = 0 if word is empty or (c^i + hash(remaining substring of the word))
where i is the zero-based alphabet index of the first letter.
Algorithm:
1. In O(n), determine hash, h, of the letter sorted version of the word in question.
2. In O(log p), search for the hash in L.
3. In O(n), list j associated words of length n.
Try these
http://www.php.net/manual/en/function.similar-text.php
http://www.php.net/manual/en/function.soundex.php
http://www.php.net/manual/en/function.levenshtein.php
The slow option would be to generate all permutations of the letters in a scrambled word, then probe them via pspell_check().
If you however can use a raw dictionary text file, then the best option is to just use a regular expression to scan it:
$dict = file_get_contents("words.txt"); // one word per line
$n = strlen($word);
if (preg_match('/^[$word]{$n}$/im', $dict, $match)) {
print $match[0];
}
I'm quite certain PCRE is significantly faster at searching for permutations than PHP and the guessing approach.
Make use of PHP's array functions since they can solve this for you.
$words = array('hello', 'food', 'stuff', 'happy', 'fast');
$scrambled_word = 'oehll';
foreach ($words as $word)
{
// Same length?
if (strlen($scrambled_word) === strlen($word))
{
// Convert to an array and match
if( ! array_diff(str_split($word), str_split($scrambled_word)))
{
print "Your word is: $word";
}
}
}
Basically, you look for something the same length - then you ask PHP to see if all the letters are the same.
If you have a really large list of words and want this unscramble operation to be fast, I'd try putting the word list into a database. Next add a field to your word list table that is the sum of the ascii values of the word, and then add an index on this ascii sum.
Whenever you want to retrieve a list of possible matches just search the word table for ascii sums that match the sum of the scrambled letters. Keep in mind that you may have a few false matches so you'll have to compare all of the matched words to ensure they contain only the letters of your scrambled word (but the result set should be pretty small).
If you don't want to use a database you could implement the same basic idea using a file, just sort the list by the sum value for faster retrieval of all matches.
Example Data assumes all lowercase (a=97, b=98, c=99, ...)
bat => 311,
cat => 312, ...
Example php function to figure out the sum for a word
function asciiSum($word) {
$characters = str_split(strtolower($word));
$sum = 0;
foreach($characters as $character) {
$sum += ord($character);
}
return $sum;
}
Even faster: add another field to the database that represents the string length, then you can search for words based on an ascii sum and a string length which would further reduce the number of false matches you would need to check for.
Related
I need to match longest word of given string using regex:
for example given string
S = "hello night axe axbxbxx prom etc..."
character set 1 = [abcdexy]
character set 2 = [mnrpo]
I need to get only one word that match 2 constriants, all the word should contain characters from one set only and the chosen word should be the longest, I tried to solve this using php regex such as:
preg_match("/\b[abcdexy]+/",$s, $match1);
preg_match("/\b[mnrpo]+/",$s, $match2);
if(strlen($match1[0]) > strlen($match2[0]))
{
//output match1[0];
}
else
{
//output match2[0]
}
The expected output should be axbxbxx since it contain only characters from set 1 and it is the longest between words that belong to one of the two sets.
My question is, can I make this work using only regex without need for strlen() testing?
You can write a single regex expression that uses a pipe to match both character ranges, then sort the matched values by descending length and access the first element's value.
Code: (Demo)
$string='hello proxy night pom-pom-mop axe prom etc decayed';
if (preg_match_all('~\b(?:[a-exy]+|[m-pr]+)\b~', $string, $out)) {
usort($out[0], function($a, $b) {return strlen($b) - strlen($a);}); // or spaceship operator if you like
echo $out[0][0];
} else {
echo "no matches";
}
Output:
decayed
The above method is not "tie-aware" so if you have two values or more values that share the highest length, you will only get one value in the output. I think you need to build in some additional logic to handle these fringe cases like:
Output all highest length values or
Set a secondary criteria to break ties on length
I'll not bother coding up these solution extensions since I prefer not to go down rabbit holes.
I have an array of strings generated randomly. Now, how am I going to check if a string is correctly spelled or not, based on US English dictionary. This way, I can remove non-English words from the list.
What I did right now is to loop through the list and have it queried to a database of dictionary words. Unfortunately, it is not efficient especially if my list contains hundred of words.
I have read about Aspell but unfortunately, I have to install it, and I am restricted because I am hosting the program in a shared web hosting.
Anyway, here's what I have so far:
// generate random strings using the method I coded
// returns a string array of generated strings
// no duplicates generated here
// just plain permutations
$generated_list = generate();
Since I have read an article that instead of looping and do query for each string, I just did a single query, like this Performing A Query In A Loop :
$only_english_list = [];
if (count($generated_list)) {
$result = $connection->query("SELECT `word` FROM `us_eng` WHERE `word` IN (" . implode(',', $generated_list));
while ($row = $result->fetch_row()) {
$only_english_list[] = $row['word'];
}
}
However, is there more efficient in checking if a string is in English dictionary? Something like a method that will return true or false?
I now have an answer to this problem. Here's what I did. Instead of generating permutations, which takes a MASSIVE amount of time and resources, I just utilize IMMEDIATELY the capability of MySQL. That is, I used REGEXP or LIKE against a table of English words of a certain length.
So, for English words that can be formed from vleoly, I used this query to a table of English words of length 6, noted by us_6.
SELECT word FROM us_6 WHERE
word REGEXP 'v' AND
word REGEXP 'l.*l' AND
word REGEXP 'e' AND
word REGEXP 'o' AND
word REGEXP 'y'
And results generated are lovely and volley.
For more information, check MySQL, REGEXP - Find Words Which Contain Only The Following Exact Letters
I'm trying to extract the parts which are similar from multiple strings.
The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.
This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.
For example, my strings might be:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='published by xyz publisher the historv of the internot, expanded and';
$title[3]='history of the internet';
So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.
The strings should be cut to:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='the historv of the internot, expanded and';
$title[3]='XXX history of the internet';
So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".
So far I have got:
function CompareSimilarStrings($array)
{
$n=count($array);
// Get length of each word in each string >
for($run=0; $run<$n; $run++)
{
$temp=explode(' ',$array[$run]);
foreach($temp as $key => $val)
$len[$run][$key]=strlen($val);
}
for($run=0; $run<$n; $run++)
{
}
}
As you can see, I'm stuck on finding the running matches.
Any ideas?
You should look into Smith-Waterman algorithm for local string alignment. It is a dynamic programming algorithm which finds parts of the string which are similar in that they have low edit distance.
So if you want to try it out, here is a php implementation of the algorithm.
I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.
I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.
If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.
Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)
If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.
strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.
Add tags manually? Just like we add tags here at SO.
How can you match the following words by PHP, either by regex/globbing/...?
Examples
INNO, heppeh, isi, pekkep, dadad, mum
My attempt would be to make a regex which has 3 parts:
1st match match [a-zA-Z]*
[a-zA-Z]?
rotation of the 1st match // Problem here!
The part 3 is the problem, since I do not know how to rotate the match.
This suggests me that regex is not the best solution here, since it is too very inefficient for long words.
I think regex are a bad solution. I'd do something with the condition like: ($word == strrev($word)).
Regexs are not suitable for finding palindromes of an arbitrary length.
However, if you are trying to find all of the palindromes in a large set of text, you could use regex to find a list of things that might be palindromes, and then filter that list to find the words that actually are palindromes.
For example, you can use a regex to find all words such that the first X characters are the reverse of the last X characters (from some small fixed value of X, like 2 or 3), and then run a secondary filter against all the matches to see if the whole word is in fact a palindrome.
In PHP once you get the string you want to check (by regex or split or whatever) you can just:
if ($string == strrev($string)) // it's a palindrome!
i think this regexp can work
$re = '~([a-z])(.?|(?R))\1~';