PHP Extract Similar Parts from Multiple Strings - php

I'm trying to extract the parts which are similar from multiple strings.
The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.
This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.
For example, my strings might be:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='published by xyz publisher the historv of the internot, expanded and';
$title[3]='history of the internet';
So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.
The strings should be cut to:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='the historv of the internot, expanded and';
$title[3]='XXX history of the internet';
So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".
So far I have got:
function CompareSimilarStrings($array)
{
$n=count($array);
// Get length of each word in each string >
for($run=0; $run<$n; $run++)
{
$temp=explode(' ',$array[$run]);
foreach($temp as $key => $val)
$len[$run][$key]=strlen($val);
}
for($run=0; $run<$n; $run++)
{
}
}
As you can see, I'm stuck on finding the running matches.
Any ideas?

You should look into Smith-Waterman algorithm for local string alignment. It is a dynamic programming algorithm which finds parts of the string which are similar in that they have low edit distance.
So if you want to try it out, here is a php implementation of the algorithm.

Related

PHP - How to check if the spelling of a string is correct or not? No “suggestions for misspelled words” needed. Just return plain true or false

I have an array of strings generated randomly. Now, how am I going to check if a string is correctly spelled or not, based on US English dictionary. This way, I can remove non-English words from the list.
What I did right now is to loop through the list and have it queried to a database of dictionary words. Unfortunately, it is not efficient especially if my list contains hundred of words.
I have read about Aspell but unfortunately, I have to install it, and I am restricted because I am hosting the program in a shared web hosting.
Anyway, here's what I have so far:
// generate random strings using the method I coded
// returns a string array of generated strings
// no duplicates generated here
// just plain permutations
$generated_list = generate();
Since I have read an article that instead of looping and do query for each string, I just did a single query, like this Performing A Query In A Loop :
$only_english_list = [];
if (count($generated_list)) {
$result = $connection->query("SELECT `word` FROM `us_eng` WHERE `word` IN (" . implode(',', $generated_list));
while ($row = $result->fetch_row()) {
$only_english_list[] = $row['word'];
}
}
However, is there more efficient in checking if a string is in English dictionary? Something like a method that will return true or false?
I now have an answer to this problem. Here's what I did. Instead of generating permutations, which takes a MASSIVE amount of time and resources, I just utilize IMMEDIATELY the capability of MySQL. That is, I used REGEXP or LIKE against a table of English words of a certain length.
So, for English words that can be formed from vleoly, I used this query to a table of English words of length 6, noted by us_6.
SELECT word FROM us_6 WHERE
word REGEXP 'v' AND
word REGEXP 'l.*l' AND
word REGEXP 'e' AND
word REGEXP 'o' AND
word REGEXP 'y'
And results generated are lovely and volley.
For more information, check MySQL, REGEXP - Find Words Which Contain Only The Following Exact Letters

Extracting specific parts of a predictably structured string with unpredictable contents

Ok, I have a complex problem for you guys.
I am trying to extract some values from a load of old data. It's a bunch of strings which are basically 7 parts concatenated with ||
test1||keep||1:1||test||3462||7885||test
Rules
Each section of the string could have any character in it, except | or two arrows like this <> (see further down) which are reserved as separators.
Any of the sections could be empty.
e.g. In this one the first 1st, 5th and 6th sections are empty, and the 3rd contains lots of non-alphanumeric characters.
||keep||test's\ (o-kay?).go_od||test||||||test
Furthermore...
Some of the strings are made up of multiple ones of these 7 pieces, further separated with <>
test1||keep||1:1||test||3462||7885||test<>test1||keep||1:1||test||3462||7885||test<>test1||keep||1:1||test||3462||7885||test
Remember, any of the inner sections could be empty.
test54||keep||test's\ (o-kay?).go_od||test||||||<>test||keep||test545's'/.||test||||test||test
The Goal
Extract just the second part of every string, and put into an array. In my examples above, it is every part which has the word keep inside.
So for this example:
||keep||test's\ (o-kay?).go_od||test||||||test
I want to get:
array('keep')
And for this example:
test1||keep-me||1:1||test||3462||7885||test<>||keep||||||3462||7885||<>test1||keep-me-too!||1:1||test||3462||||test
It can be seen as 3 different strings which are separated by <>:
test1||keep-me||1:1||test||3462||7885||test
||keep||||||3462||7885||
test1||keep-me-too!||1:1||test||3462||||test
And I want to extract:
array('keep-me', 'keep', 'keep-me-too!')
Notes
I have tried doing this with preg_match but look-behind doesn't like searching for non-fixed length strings.
I cannot change the data. It is old data I just have to work with.
$array = [];
$strings = explode('<>', $yourContent);
foreach ($strings as $string) {
$array[] = explode('||', $string)[1];
}
This uses array dereferencing introduced in PHP 5.4.

What is the best way to unscramble words with PHP?

I have a word list and I want to unscramble words using this word list, in PHP.
It seems to me that PHP doesn't have a built-in function that does this. So could someone please suggest a good algorithm to do this, or at least point me in the right direction?
EDIT: edited to add example
So basically, what I'm talking about is I have a list of words:
apple
banana
orange
Then, I'm given a bunch of jumbled letters.
pplea
nanaba
eroang
Given a dictionary of known words:
foreach ($list as $word)
{
if (count_chars($scrambled_word,1) == count_chars($word,1))
echo "$word\n";
}
Edit: A simple optimization would be to move the count_chars($scrambled_word,1)) outside the loop since it never changes:
$letters = count_chars($scrambled_word,1)
foreach ($list as $word)
{
if ($letters == count_chars($word,1))
echo "$word\n";
}
Warning: I rarely use PHP, so this is dealing only with a general algorithm that should work in almost any language, not anything specific to PHP.
Presumably you have a word in which the letters have been rearranged, and you want to find what word(s) could be made from those letters.
If that's correct, the general idea is fairly simple: take a copy of your word list, and sort the letters in each word into alphabetical order. Put the sorted and unsorted versions of each word side by side, and sort the whole thing by the sorted words (but keeping each unsorted word along with its sorted version). You may want to collapse duplicates together, so that (for example) instead of {abt : bat} and {abt : tab}, you have: {abt: bat, tab}
Then, to match up a scrambled word, sort its letters in alphabetical order. Look for matches in your dictionary (since it's sorted, you can use a binary search). When you find a match, the result is the word (or words) associated with that sorted letter group. Using the example above, if the scrambled word was "tba", you'd sort it to get "abt", then look up "abt" to get "bat" and "tab".
Edit: As #Moron pointed out in the comments, sorting and binary search aren't really crucial points in themselves. The basic points are to turn all equivalent inputs into identical keys, then use some sort of fast lookup by key to find the word(s) for that key.
Sorting the letters in each word is one easy way to turn equivalent inputs into identical keys. Sorting the list and doing a binary search is one easy way to do fast lookups by key.
In both cases, there are quite a few alternatives. I'm not at all sure the alternatives are likely to improve performance a lot, but they certainly could.
Just for example, instead of a pure binary search you could have a second level of index that told you where the keys starting with 'a' were, the keys starting with 'b', and so on. Given that a couple of extremely frequently-used letters are near the beginning of the alphabet ('e' and 'a', for example) you might be better off sorting the words so that relatively uncommon letters ('q', 'z', etc.) are toward the front of the key, and the most commonly used letters are at the end. This would give that first lookup based on the initial character the greatest discrimination.
On the sort/binary search side, there are probably more alternatives, and probably better arguments to be made in favor of using something else. Hash tables typically allow lookups in (nearly) constant time. Tries can reduce storage substantially, especially when many words share a common prefix. The only obvious disadvantage is that the code for either one is probably more work (though PHP's array type is hash-based, so you could probably use it quite nicely).
It is possible to unscramble in O(log p + n) where
p = size of dictionary
n = length of word to be unscrambled
Assume a constant, c, the most occurrences of some letter within any word plus 1.
Assume a constant, k, the number of letters in the alphabet.
Assume a constant, j, the most number of words that can share the same hash or letter-sorted version.
Initialization of O(p) space:
1. Using the dictionary, D, create an associated list of letter sorted words, L, which will be size at most p since each word has a one sorted version.
2. Associate another column to L with a numerical hash of integers which can range [0, c^k-1].
3. For each word in L, generate its hash with the following function:
hash(word) = 0 if word is empty or (c^i + hash(remaining substring of the word))
where i is the zero-based alphabet index of the first letter.
Algorithm:
1. In O(n), determine hash, h, of the letter sorted version of the word in question.
2. In O(log p), search for the hash in L.
3. In O(n), list j associated words of length n.
Try these
http://www.php.net/manual/en/function.similar-text.php
http://www.php.net/manual/en/function.soundex.php
http://www.php.net/manual/en/function.levenshtein.php
The slow option would be to generate all permutations of the letters in a scrambled word, then probe them via pspell_check().
If you however can use a raw dictionary text file, then the best option is to just use a regular expression to scan it:
$dict = file_get_contents("words.txt"); // one word per line
$n = strlen($word);
if (preg_match('/^[$word]{$n}$/im', $dict, $match)) {
print $match[0];
}
I'm quite certain PCRE is significantly faster at searching for permutations than PHP and the guessing approach.
Make use of PHP's array functions since they can solve this for you.
$words = array('hello', 'food', 'stuff', 'happy', 'fast');
$scrambled_word = 'oehll';
foreach ($words as $word)
{
// Same length?
if (strlen($scrambled_word) === strlen($word))
{
// Convert to an array and match
if( ! array_diff(str_split($word), str_split($scrambled_word)))
{
print "Your word is: $word";
}
}
}
Basically, you look for something the same length - then you ask PHP to see if all the letters are the same.
If you have a really large list of words and want this unscramble operation to be fast, I'd try putting the word list into a database. Next add a field to your word list table that is the sum of the ascii values of the word, and then add an index on this ascii sum.
Whenever you want to retrieve a list of possible matches just search the word table for ascii sums that match the sum of the scrambled letters. Keep in mind that you may have a few false matches so you'll have to compare all of the matched words to ensure they contain only the letters of your scrambled word (but the result set should be pretty small).
If you don't want to use a database you could implement the same basic idea using a file, just sort the list by the sum value for faster retrieval of all matches.
Example Data assumes all lowercase (a=97, b=98, c=99, ...)
bat => 311,
cat => 312, ...
Example php function to figure out the sum for a word
function asciiSum($word) {
$characters = str_split(strtolower($word));
$sum = 0;
foreach($characters as $character) {
$sum += ord($character);
}
return $sum;
}
Even faster: add another field to the database that represents the string length, then you can search for words based on an ascii sum and a string length which would further reduce the number of false matches you would need to check for.

Any faster, simpler alternative to php preg_match

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.
I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.
If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.
Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)
If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.
strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.
Add tags manually? Just like we add tags here at SO.

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part i know what to expect:
http://www.someurl.com/st=????????
Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ
Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.
The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?
update:
To clear up some confusion, I get an input string that's like this:
[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????
except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\
$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
$clean = join(
array_filter(
str_split($var, 1),
function ($char) {
return (
array_key_exists(
$char,
array_flip(array_merge(
range('A','Z'),
range('a','z'),
range((string)'0',(string)'9'),
array(':','.','/','-','_')
))
)
);
}
)
);
Hah, that was a joke. Here's a regex for you:
$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);
As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":
__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__
What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().
You can use this regular expression :
if (preg_match('/[\'^£$%&*()}{##~?><>,|=_+¬-]/', $string) ==1)

Categories