Compare sentence with high similarity - php

I'm trying to create a method/function that compares two sentence and returns a percentage of their similarity.
For e.g. in PHP there is a function called similar_text, but it's not working well.
Here I have a few examples that should get a high similartiy when comparing against each other:
In the backyard there is a green tree and the sun is shinnying.
The sun is shinnying in the backyard and there is a green tree too.
A yellow tree is in the backyard with a shinnying sun.
In the front yard there is a green tree and the sun is shinnying.
In the front yard there is a red tree and the sun is no shinnying.
Does anyone know how to get a good example?
I would prefere to use PHP for it, but I don't mind to use Java or Python for it.
In the internet I found this function:
function compareStrings($s1, $s2) {
//one is empty, so no result
if (strlen($s1)==0 || strlen($s2)==0) {
return 0;
}
//replace none alphanumeric charactors
//i left - in case its used to combine words
$s1clean = preg_replace("/[^A-Za-z-]/", ' ', $s1);
$s2clean = preg_replace("/[^A-Za-z-]/", ' ', $s2);
//remove double spaces
$s1clean = str_replace(" ", " ", $s1clean);
$s2clean = str_replace(" ", " ", $s2clean);
//create arrays
$ar1 = explode(" ",$s1clean);
$ar2 = explode(" ",$s2clean);
$l1 = count($ar1);
$l2 = count($ar2);
//flip the arrays if needed so ar1 is always largest.
if ($l2>$l1) {
$t = $ar2;
$ar2 = $ar1;
$ar1 = $t;
}
//flip array 2, to make the words the keys
$ar2 = array_flip($ar2);
$maxwords = max($l1, $l2);
$matches = 0;
//find matching words
foreach($ar1 as $word) {
if (array_key_exists($word, $ar2))
$matches++;
}
return ($matches / $maxwords) * 100;
}
But it's only returning 80%. similar_text is returning just 39%.

Related

How can I split a string into three word chunks?

I need to split a string in every three words using PHP
"This is an example of what I need."
The output would be:
This is an
is an example
an example of
example of what
of what I
what I need
I have this example with Java
String myString = "This is an example of what I need.";
String[] words = myString.split("\\s+");
for (int i = 0; i < words.length; i++) {
String threeWords;
if (i == words.length - 1)
threeWords = words[i];
else if(i == words.length - 2)
threeWords = words[i] + " " + words[i + 1];
else
threeWords = words[i] + " " + words[i + 1] + " " + words[i + 2];
System.out.println(threeWords);
}
Solution that use explode, array_slice and implode.
$example = 'This is an example of what I need.';
$arr = explode(" ",$example);
foreach($arr as $key => $word){
//if($key == 0) continue;
$subArr = array_slice($arr,$key,3);
if(count($subArr) < 3) break;
echo implode(" ",$subArr)."<br>\n";
}
Output:
This is an
is an example
an example of
example of what
of what I
what I need.
If you want to suppress the first output with This, remove the comment in the line
//if($key == 0) continue;
If the example can have less than 3 words and these should be output then the line with the break must be as follows:
if(count($subArr) < 3 AND $key != 0) break;
For strings that are not only separated by single spaces, preg_split is recommended. Example:
$example = "This is an example of what I need.
Sentence,containing a comma and Raphaël.";
$arr = preg_split('/[ .,;\r\n\t]+/u', $example, 0, PREG_SPLIT_NO_EMPTY);
foreach($arr as $key => $word){
$subArr = array_slice($arr,$key,3);
if(count($subArr) < 3) break;
echo implode(" ",$subArr)."<br>\n";
}
Output:
This is an
is an example
an example of
example of what
of what I
what I need
I need Sentence
need Sentence containing
Sentence containing a
containing a comma
a comma and
comma and Raphaël
To include only words (notice the fullstop is removed from the last trigram), use str_word_count() to form the array of strings.
Then you need to loop while there are three elements to print.
Code: (Demo)
$example = 'This is an example of what I need.';
$words = str_word_count($example, 1);
for ($x = 0; isset($words[$x + 2]); ++$x) {
printf("%s %s %s\n", $words[$x], $words[$x + 1], $words[$x + 2]);
}
Output:
This is an
is an example
an example of
example of what
of what I
what I need
If you don't like printf(), you could echo implode(' ', [the three elements]).
If you want to print a single string of words when there are less than 3 total words in the string, then you could use a post-test loop. Demo
And then, of course, if we going to stumble down the rocky road of "what is a word", then an ironclad definition of "what is a word" will need to be defined and then a regex (potentially with multibyte support) will need to be suitably crafted. Basic Demo

How can I make a string from two strings with the common and uncommon words in php?

I have tried with this but the code returns the common words only. But I want the output from bottom two variables like this "Warfarin 5 mg Tablet, Syrup".
$string1 = "Warfarin 5 mg Tablet";
$string2 = "Warfarin 5 mg Syrup";
function show_unique_strings($a, $b) {
$aArray = explode(" ",$a);
$bArray = explode(" ",$b);
$intersect = array_intersect($aArray, $bArray);
$str = implode(" ", array_merge(array_diff($aArray, $intersect), array_diff($bArray, $intersect)));
return $str;
}
return show_unique_strings($string1, $string2);
Please try this.
lets say two strings are "Warfarin 5 mg Tablet" and "Warfarin 5 mg Syrup"
$string1 = "Warfarin 5 mg Tablet";
$string2 = "Warfarin 5 mg Syrup";
$diff = array_diff(explode(" ", $string2), explode(" ", $string1));
$get = current($diff);
$response = $string1.', '.$get;
but again I asking are those strings constant or dynamic
Assuming the word-count is the same across all Strings:
function show_unique_strings(...$strings) {
$stringsAsArray = array_map(fn ($string) => explode(' ', $string), $strings);
$countWords = count($stringsAsArray[0]);
$resultArray = [];
for ($i = 0; $i < $countWords; $i++) {
$words = array_unique(array_column($stringsAsArray, $i));
$resultArray[] = implode(', ', $words);
}
return implode(' ', $resultArray);
}
Code in words:
We convert every supplied string to an array via explode within array_map.
Then we count how many words the first String has (count on exploded string).
For every word-position array_column at $i we then get only the uniques array_unique - this way we get an array for unique words for position $i across all strings.
We join this array using implode and save it in an array $resultArray
Finally we join $resultArray likewiese and return it.
Working example: https://3v4l.org/BlFWW

Find most repeated sub strings in array

I have an array:
$myArray=array(
'hello my name is richard',
'hello my name is paul',
'hello my name is simon',
'hello it doesn\'t matter what my name is'
);
I need to find the sub string (min 2 words) that is repeated the most often, maybe in an array format, so my return array could look like this:
$return=array(
array('hello my', 3),
array('hello my name', 3),
array('hello my name is', 3),
array('my name', 4),
array('my name is', 4),
array('name is', 4),
);
So I can see from this array of arrays how often each string was repeated amongst all strings in the array.
Is the only way to do it like this?..
function repeatedSubStrings($array){
foreach($array as $string){
$phrases=//Split each string into maximum number of sub strings
foreach($phrases as $phrase){
//Then count the $phrases that are in the strings
}
}
}
I've tried a solution similar to the above but it was too slow, processing around 1000 rows per second, can anyone do it faster?
A solution to this might be
function getHighestRecurrence($strs){
/*Storage for individual words*/
$words = Array();
/*Process multiple strings*/
if(is_array($strs))
foreach($strs as $str)
$words = array_merge($words, explode(" ", $str));
/*Prepare single string*/
else
$words = explode(" ",$strs);
/*Array for word counters*/
$index = Array();
/*Aggregate word counters*/
foreach($words as $word)
/*Increment count or create if it doesn't exist*/
(isset($index[$word]))? $index[$word]++ : $index[$word] = 1;
/*Sort array hy highest value and */
arsort($index);
/*Return the word*/
return key($index);
}
While this has a higher runtime, I think it's simpler from an implementation perspective:
$substrings = array();
foreach ($myArray as $str)
{
$subArr = explode(" ", $str);
for ($i=0;$i<count($subArr);$i++)
{
$substring = "";
for ($j=$i;$j<count($subArr);$j++)
{
if ($i==0 && ($j==count($subArr)-1))
break;
$substring = trim($substring . " " . $subArr[$j]);
if (str_word_count($substring, 0) > 1)
{
if (array_key_exists($substring, $substrings))
$substrings[$substring]++;
else
$substrings[$substring] = 1;
}
}
}
}
arsort($substrings);
print_r($substrings);
I'm assuming by "substring" you really mean "substring split along word boundaries" since that's what your example shows.
In that case, assuming any maximum repeated substring will do (since there may be ties), you can always choose just a single word as a maximum repeated substring, if you think about it. For any phrase "A B", the phrases "A" and "B" individually must occur at least as often as "A B" because they both occur every time "A B" does and they may occur at other times. Therefore, a single word must be have a count that at least ties with any substring that contains that word.
So you just need to split all phrases into a set of unique words, and then just count the words and return one of the words with the highest count. This will run way faster than actually counting every possible substring.
This should run in O(n) time
$twoWordPhrases = function($str) {
$words = preg_split('#\s+#', $str, -1, PREG_SPLIT_NO_EMPTY);
$phrases = array();
foreach (range(0, count($words) - 2) as $offset) {
$phrases[] = array_slice($words, $offset, 2);
}
return $phrases;
};
$frequencies = array();
foreach ($myArray as $str) {
$phrases = $twoWordPhrases($str);
foreach ($phrases as $phrase) {
$key = join('/', $phrase);
if (!isset($frequencies[$key])) {
$frequencies[$key] = 0;
}
$frequencies[$key]++;
}
}
print_r($frequencies);

"Unfolding" a String

I have a set of strings, each string has a variable number of segments separated by pipes (|), e.g.:
$string = 'abc|b|ac';
Each segment with more than one char should be expanded into all the possible one char combinations, for 3 segments the following "algorithm" works wonderfully:
$result = array();
$string = explode('|', 'abc|b|ac');
foreach (str_split($string[0]) as $i)
{
foreach (str_split($string[1]) as $j)
{
foreach (str_split($string[2]) as $k)
{
$result[] = implode('|', array($i, $j, $k)); // more...
}
}
}
print_r($result);
Output:
$result = array('a|b|a', 'a|b|c', 'b|b|a', 'b|b|c', 'c|b|a', 'c|b|c');
Obviously, for more than 3 segments the code starts to get extremely messy, since I need to add (and check) more and more inner loops. I tried coming up with a dynamic solution but I can't figure out how to generate the correct combination for all the segments (individually and as a whole). I also looked at some combinatorics source code but I'm unable to combine the different combinations of my segments.
I appreciate if anyone can point me in the right direction.
Recursion to the rescue (you might need to tweak a bit to cover edge cases, but it works):
function explodinator($str) {
$segments = explode('|', $str);
$pieces = array_map('str_split', $segments);
return e_helper($pieces);
}
function e_helper($pieces) {
if (count($pieces) == 1)
return $pieces[0];
$first = array_shift($pieces);
$subs = e_helper($pieces);
foreach($first as $char) {
foreach ($subs as $sub) {
$result[] = $char . '|' . $sub;
}
}
return $result;
}
print_r(explodinator('abc|b|ac'));
Outputs:
Array
(
[0] => a|b|a
[1] => a|b|c
[2] => b|b|a
[3] => b|b|c
[4] => c|b|a
[5] => c|b|c
)
As seen on ideone.
This looks like a job for recursive programming! :P
I first looked at this and thought it was going to be a on-liner (and probably is in perl).
There are other non-recursive ways (enumerate all combinations of indexes into segments then loop through, for example) but I think this is more interesting, and probably 'better'.
$str = explode('|', 'abc|b|ac');
$strlen = count( $str );
$results = array();
function splitAndForeach( $bchar , $oldindex, $tempthread) {
global $strlen, $str, $results;
$temp = $tempthread;
$newindex = $oldindex + 1;
if ( $bchar != '') { array_push($temp, $bchar ); }
if ( $newindex <= $strlen ){
print "starting foreach loop on string '".$str[$newindex-1]."' \n";
foreach(str_split( $str[$newindex - 1] ) as $c) {
print "Going into next depth ($newindex) of recursion on char $c \n";
splitAndForeach( $c , $newindex, $temp);
}
} else {
$found = implode('|', $temp);
print "Array length (max recursion depth) reached, result: $found \n";
array_push( $results, $found );
$temp = $tempthread;
$index = 0;
print "***************** Reset index to 0 *****************\n\n";
}
}
splitAndForeach('', 0, array() );
print "your results: \n";
print_r($results);
You could have two arrays: the alternatives and a current counter.
$alternatives = array(array('a', 'b', 'c'), array('b'), array('a', 'c'));
$counter = array(0, 0, 0);
Then, in a loop, you increment the "last digit" of the counter, and if that is equal to the number of alternatives for that position, you reset that "digit" to zero and increment the "digit" left to it. This works just like counting with decimal numbers.
The string for each step is built by concatenating the $alternatives[$i][$counter[$i]] for each digit.
You are finished when the "first digit" becomes as large as the number of alternatives for that digit.
Example: for the above variables, the counter would get the following values in the steps:
0,0,0
0,0,1
1,0,0 (overflow in the last two digit)
1,0,1
2,0,0 (overflow in the last two digits)
2,0,1
3,0,0 (finished, since the first "digit" has only 3 alternatives)

Affinity between a text and a list of keywords?

I have a portion of text (500-1500 chars)
And I have a list of keywords (1000 records)..
What should I do to find the keywords from that list that are related to my given text?
I was thinking to search the occorences of those keywords in my text for every keywords in the list, but it's a bit "expensive" i think
Thanks
If the keywords always stay the same you could create an index over them which improves search speed (tremendously). The standard data structure to handle this is the trie but a much better (!) alternative is the Aho-Corasick automaton or another multi-pattern search algorithm such as multi-pattern Horspool (also known as Wu-Manber algorithm).
Finally, a very simple alternative is to concatenate all your keywords with pipes (|) and use the result as a regular expression. Technically, this approaches the Aho-Corasick automaton and is much simpler for you to implement.
I throw my hat in the ring …
function extractWords($text, $minWordLength = null, array $stopwords = array(), $caseIgnore = true)
{
$pattern = '/\w'. (is_null($minWordLength) ? '+' : '{'.$minWordLength.',}') .'/';
$matches = array();
preg_match_all($pattern, $text, $matches);
$words = $matches[0];
if ($caseIgnore) {
$words = array_map('strtolower', $words);
$stopWords = array_map('strtolower', $stopwords);
}
$words = array_diff($words, $stopwords);
return $words;
}
function countKeywords(array $words, array $keywords, $threshold = null, $caseIgnore = true)
{
if ($caseIgnore) {
$keywords = array_map('strtolower', $keywords);
}
$words = array_intersect($words, $keywords);
$counts = array_count_values($words);
arsort($counts, SORT_NUMERIC);
if (!is_null($threshold)) {
$counts = array_filter($counts, function ($count) use ($threshold) { return $count >= $threshold; });
}
return $counts;
}
Usage:
$text = 'a b c a'; // your text
$keywords = array('a', 'b'); // keywords from your database
$words = extractWords($text);
$count = countKeywords($words, $keywords);
print_r($count);
$total = array_sum($count);
var_dump($total);
$affinity = ($total == 0 ? 0 : 1 / (count($words) / $total));
var_dump($affinity);
Prints
Array
(
[a] => 2
[b] => 1
)
int(3)
float(0.75)

Categories