Find most repeated sub strings in array

Find most repeated sub strings in array - php

I have an array:
$myArray=array(
'hello my name is richard',
'hello my name is paul',
'hello my name is simon',
'hello it doesn\'t matter what my name is'
);
I need to find the sub string (min 2 words) that is repeated the most often, maybe in an array format, so my return array could look like this:
$return=array(
array('hello my', 3),
array('hello my name', 3),
array('hello my name is', 3),
array('my name', 4),
array('my name is', 4),
array('name is', 4),
);
So I can see from this array of arrays how often each string was repeated amongst all strings in the array.
Is the only way to do it like this?..
function repeatedSubStrings($array){
foreach($array as $string){
$phrases=//Split each string into maximum number of sub strings
foreach($phrases as $phrase){
//Then count the $phrases that are in the strings
}
}
}
I've tried a solution similar to the above but it was too slow, processing around 1000 rows per second, can anyone do it faster?

A solution to this might be
function getHighestRecurrence($strs){
/*Storage for individual words*/
$words = Array();
/*Process multiple strings*/
if(is_array($strs))
foreach($strs as $str)
$words = array_merge($words, explode(" ", $str));
/*Prepare single string*/
else
$words = explode(" ",$strs);
/*Array for word counters*/
$index = Array();
/*Aggregate word counters*/
foreach($words as $word)
/*Increment count or create if it doesn't exist*/
(isset($index[$word]))? $index[$word]++ : $index[$word] = 1;
/*Sort array hy highest value and */
arsort($index);
/*Return the word*/
return key($index);
}

While this has a higher runtime, I think it's simpler from an implementation perspective:
$substrings = array();
foreach ($myArray as $str)
{
$subArr = explode(" ", $str);
for ($i=0;$i<count($subArr);$i++)
{
$substring = "";
for ($j=$i;$j<count($subArr);$j++)
{
if ($i==0 && ($j==count($subArr)-1))
break;
$substring = trim($substring . " " . $subArr[$j]);
if (str_word_count($substring, 0) > 1)
{
if (array_key_exists($substring, $substrings))
$substrings[$substring]++;
else
$substrings[$substring] = 1;
}
}
}
}
arsort($substrings);
print_r($substrings);

I'm assuming by "substring" you really mean "substring split along word boundaries" since that's what your example shows.
In that case, assuming any maximum repeated substring will do (since there may be ties), you can always choose just a single word as a maximum repeated substring, if you think about it. For any phrase "A B", the phrases "A" and "B" individually must occur at least as often as "A B" because they both occur every time "A B" does and they may occur at other times. Therefore, a single word must be have a count that at least ties with any substring that contains that word.
So you just need to split all phrases into a set of unique words, and then just count the words and return one of the words with the highest count. This will run way faster than actually counting every possible substring.

This should run in O(n) time
$twoWordPhrases = function($str) {
$words = preg_split('#\s+#', $str, -1, PREG_SPLIT_NO_EMPTY);
$phrases = array();
foreach (range(0, count($words) - 2) as $offset) {
$phrases[] = array_slice($words, $offset, 2);
}
return $phrases;
};
$frequencies = array();
foreach ($myArray as $str) {
$phrases = $twoWordPhrases($str);
foreach ($phrases as $phrase) {
$key = join('/', $phrase);
if (!isset($frequencies[$key])) {
$frequencies[$key] = 0;
}
$frequencies[$key]++;
}
}
print_r($frequencies);

Related

Compare sentence with high similarity

I'm trying to create a method/function that compares two sentence and returns a percentage of their similarity.
For e.g. in PHP there is a function called similar_text, but it's not working well.
Here I have a few examples that should get a high similartiy when comparing against each other:
In the backyard there is a green tree and the sun is shinnying.
The sun is shinnying in the backyard and there is a green tree too.
A yellow tree is in the backyard with a shinnying sun.
In the front yard there is a green tree and the sun is shinnying.
In the front yard there is a red tree and the sun is no shinnying.
Does anyone know how to get a good example?
I would prefere to use PHP for it, but I don't mind to use Java or Python for it.
In the internet I found this function:
function compareStrings($s1, $s2) {
//one is empty, so no result
if (strlen($s1)==0 || strlen($s2)==0) {
return 0;
}
//replace none alphanumeric charactors
//i left - in case its used to combine words
$s1clean = preg_replace("/[^A-Za-z-]/", ' ', $s1);
$s2clean = preg_replace("/[^A-Za-z-]/", ' ', $s2);
//remove double spaces
$s1clean = str_replace(" ", " ", $s1clean);
$s2clean = str_replace(" ", " ", $s2clean);
//create arrays
$ar1 = explode(" ",$s1clean);
$ar2 = explode(" ",$s2clean);
$l1 = count($ar1);
$l2 = count($ar2);
//flip the arrays if needed so ar1 is always largest.
if ($l2>$l1) {
$t = $ar2;
$ar2 = $ar1;
$ar1 = $t;
}
//flip array 2, to make the words the keys
$ar2 = array_flip($ar2);
$maxwords = max($l1, $l2);
$matches = 0;
//find matching words
foreach($ar1 as $word) {
if (array_key_exists($word, $ar2))
$matches++;
}
return ($matches / $maxwords) * 100;
}
But it's only returning 80%. similar_text is returning just 39%.

PHP Count occurance of 2 lists of words appear string

I have two lists of words. The idea is to count how many times each word appears in an article, then calculate the difference.
Example:
List1 = "how, now, brown, cow"
List2 = "he, usually, urges, an, umbrella, upon, us"
Content: "How can I buy a cow when the umbrella is cheaper?"
Result: List1(2) - List2(1) = 1
I have fairly noobish PHP skills.

In this case, we can use the Php functions explode and as incognito-skull mentioned, in_array. You can do it by doing something like so:
$list1 = ['how', 'now', 'brown', 'cow'];
$list2 = ['he', 'usually', 'urges', 'an', 'umbrella', 'upon', 'us'];
$timesAList1WordAppeared = 0;
$timesAList2WordAppeared = 0;
$text = "how can I buy a cow when the umbrella is cheaper?";
$wordArray = explode(' ', $text);
foreach ($wordArray as $word) {
if (in_array($word, $list1)) {
$timesAList1WordAppeared++;
}
if (in_array($word, $list2)) {
$timesAList2WordAppeared++;
}
}
echo "The difference is: ".($timesAList1WordAppeared - $timesAList2WordAppeared);
Let's go at it step by step
At first, we initialize the array and counter variables
$list1 = ['how', 'now', 'brown', 'cow'];
$list2 = ['he', 'usually', 'urges', 'an', 'umbrella', 'upon', 'us'];
$timesAList1WordAppeared = 0;
$timesAList2WordAppeared = 0;
Then, we initialize the text
$text = "how can I buy a cow when the umbrella is cheaper?";
Then, we split this text using a space to get the words. This is where the explode function comes in and we use it like so
$wordArray = explode(' ', $text);
The first argument is the character or string we will use to split the text and the second argument is the text itself. Then we go through our words and count how many times a word in our list1 and list2 appears in the text. We do it like so
foreach ($wordArray as $word) {
if (in_array($word, $list1)) {
$timesAList1WordAppeared++;
}
if (in_array($word, $list2)) {
$timesAList2WordAppeared++;
}
}
The code goes like this, for each word in our wordArray, if that word is in_[the]_array list1, increment the timesAList1WordAppeared. If that word is also in_[the]_array list2, increment the timesAList2WordAppeared.
Lastly is to print the result
echo "The difference is: ".($timesAList1WordAppeared - $timesAList2WordAppeared);

Working with substr_count() and arrays in PHP

So what I need is to compare a string to an array (string as a haystack and array as a needle) and get the elements from the string that repeat within the array. For this purpose I've taken a sample function for using an array as a needle in the substr_count function.
$animals = array('cat','dog','bird');
$toString = implode(' ', $animals);
$data = array('a');
function substr_count_array($haystack, $needle){
$initial = 0;
foreach ($needle as $substring) {
$initial += substr_count($haystack, $substring);
}
return $initial;
}
echo substr_count_array($toString, $data);
The problem is that if I search for a character such as 'a', it gets through the check and validates as a legit value because 'a' is contained within the first element. So the above outputs 1. I figured this was due to the foreach() but how do I bypass that? I want to search for a whole string match, not partial.

You can break up the $haystack into individual words, then do an in_array() check over it to make sure the word exists in that array as a whole word before doing your substr_count():
$animals = array('cat','dog','bird', 'cat', 'dog', 'bird', 'bird', 'hello');
$toString = implode(' ', $animals);
$data = array('cat');
function substr_count_array($haystack, $needle){
$initial = 0;
$bits_of_haystack = explode(' ', $haystack);
foreach ($needle as $substring) {
if(!in_array($substring, $bits_of_haystack))
continue; // skip this needle if it doesn't exist as a whole word
$initial += substr_count($haystack, $substring);
}
return $initial;
}
echo substr_count_array($toString, $data);
Here, cat is 2, dog is 2, bird is 3, hello is 1 and lion is 0.
Edit: here's another alternative using array_keys() with the search parameter set to the $needle:
function substr_count_array($haystack, $needle){
$bits_of_haystack = explode(' ', $haystack);
return count(array_keys($bits_of_haystack, $needle[0]));
}
Of course, this approach requires a string as the needle. I'm not 100% sure why you need to use an array as the needle, but perhaps you could do a loop outside the function and call it for each needle if you need to - just another option anyway!

Just throwing my solution in the ring here; the basic idea, as outlined by scrowler as well, is to break up the search subject into separate words so that you can compare whole words.
function substr_count_array($haystack, $needle)
{
$substrings = explode(' ', $haystack);
return array_reduce($substrings, function($total, $current) use ($needle) {
return $total + count(array_keys($needle, $current, true));
}, 0);
}
The array_reduce() step is basically this:
$total = 0;
foreach ($substrings as $substring) {
$total = $total + count(array_keys($needle, $substring, true));
}
return $total;
The array_keys() expression returns the keys of $needle for which the value equals $substring. The size of that array is the number of occurrences.

"Unfolding" a String

I have a set of strings, each string has a variable number of segments separated by pipes (|), e.g.:
$string = 'abc|b|ac';
Each segment with more than one char should be expanded into all the possible one char combinations, for 3 segments the following "algorithm" works wonderfully:
$result = array();
$string = explode('|', 'abc|b|ac');
foreach (str_split($string[0]) as $i)
{
foreach (str_split($string[1]) as $j)
{
foreach (str_split($string[2]) as $k)
{
$result[] = implode('|', array($i, $j, $k)); // more...
}
}
}
print_r($result);
Output:
$result = array('a|b|a', 'a|b|c', 'b|b|a', 'b|b|c', 'c|b|a', 'c|b|c');
Obviously, for more than 3 segments the code starts to get extremely messy, since I need to add (and check) more and more inner loops. I tried coming up with a dynamic solution but I can't figure out how to generate the correct combination for all the segments (individually and as a whole). I also looked at some combinatorics source code but I'm unable to combine the different combinations of my segments.
I appreciate if anyone can point me in the right direction.

Recursion to the rescue (you might need to tweak a bit to cover edge cases, but it works):
function explodinator($str) {
$segments = explode('|', $str);
$pieces = array_map('str_split', $segments);
return e_helper($pieces);
}
function e_helper($pieces) {
if (count($pieces) == 1)
return $pieces[0];
$first = array_shift($pieces);
$subs = e_helper($pieces);
foreach($first as $char) {
foreach ($subs as $sub) {
$result[] = $char . '|' . $sub;
}
}
return $result;
}
print_r(explodinator('abc|b|ac'));
Outputs:
Array
(
[0] => a|b|a
[1] => a|b|c
[2] => b|b|a
[3] => b|b|c
[4] => c|b|a
[5] => c|b|c
)
As seen on ideone.

This looks like a job for recursive programming! :P
I first looked at this and thought it was going to be a on-liner (and probably is in perl).
There are other non-recursive ways (enumerate all combinations of indexes into segments then loop through, for example) but I think this is more interesting, and probably 'better'.
$str = explode('|', 'abc|b|ac');
$strlen = count( $str );
$results = array();
function splitAndForeach( $bchar , $oldindex, $tempthread) {
global $strlen, $str, $results;
$temp = $tempthread;
$newindex = $oldindex + 1;
if ( $bchar != '') { array_push($temp, $bchar ); }
if ( $newindex <= $strlen ){
print "starting foreach loop on string '".$str[$newindex-1]."' \n";
foreach(str_split( $str[$newindex - 1] ) as $c) {
print "Going into next depth ($newindex) of recursion on char $c \n";
splitAndForeach( $c , $newindex, $temp);
}
} else {
$found = implode('|', $temp);
print "Array length (max recursion depth) reached, result: $found \n";
array_push( $results, $found );
$temp = $tempthread;
$index = 0;
print "***************** Reset index to 0 *****************\n\n";
}
}
splitAndForeach('', 0, array() );
print "your results: \n";
print_r($results);

You could have two arrays: the alternatives and a current counter.
$alternatives = array(array('a', 'b', 'c'), array('b'), array('a', 'c'));
$counter = array(0, 0, 0);
Then, in a loop, you increment the "last digit" of the counter, and if that is equal to the number of alternatives for that position, you reset that "digit" to zero and increment the "digit" left to it. This works just like counting with decimal numbers.
The string for each step is built by concatenating the $alternatives[$i][$counter[$i]] for each digit.
You are finished when the "first digit" becomes as large as the number of alternatives for that digit.
Example: for the above variables, the counter would get the following values in the steps:
0,0,0
0,0,1
1,0,0 (overflow in the last two digit)
1,0,1
2,0,0 (overflow in the last two digits)
2,0,1
3,0,0 (finished, since the first "digit" has only 3 alternatives)

Affinity between a text and a list of keywords?

I have a portion of text (500-1500 chars)
And I have a list of keywords (1000 records)..
What should I do to find the keywords from that list that are related to my given text?
I was thinking to search the occorences of those keywords in my text for every keywords in the list, but it's a bit "expensive" i think
Thanks

If the keywords always stay the same you could create an index over them which improves search speed (tremendously). The standard data structure to handle this is the trie but a much better (!) alternative is the Aho-Corasick automaton or another multi-pattern search algorithm such as multi-pattern Horspool (also known as Wu-Manber algorithm).
Finally, a very simple alternative is to concatenate all your keywords with pipes (|) and use the result as a regular expression. Technically, this approaches the Aho-Corasick automaton and is much simpler for you to implement.

I throw my hat in the ring …
function extractWords($text, $minWordLength = null, array $stopwords = array(), $caseIgnore = true)
{
$pattern = '/\w'. (is_null($minWordLength) ? '+' : '{'.$minWordLength.',}') .'/';
$matches = array();
preg_match_all($pattern, $text, $matches);
$words = $matches[0];
if ($caseIgnore) {
$words = array_map('strtolower', $words);
$stopWords = array_map('strtolower', $stopwords);
}
$words = array_diff($words, $stopwords);
return $words;
}
function countKeywords(array $words, array $keywords, $threshold = null, $caseIgnore = true)
{
if ($caseIgnore) {
$keywords = array_map('strtolower', $keywords);
}
$words = array_intersect($words, $keywords);
$counts = array_count_values($words);
arsort($counts, SORT_NUMERIC);
if (!is_null($threshold)) {
$counts = array_filter($counts, function ($count) use ($threshold) { return $count >= $threshold; });
}
return $counts;
}
Usage:
$text = 'a b c a'; // your text
$keywords = array('a', 'b'); // keywords from your database
$words = extractWords($text);
$count = countKeywords($words, $keywords);
print_r($count);
$total = array_sum($count);
var_dump($total);
$affinity = ($total == 0 ? 0 : 1 / (count($words) / $total));
var_dump($affinity);
Prints
Array
(
[a] => 2
[b] => 1
)
int(3)
float(0.75)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Find most repeated sub strings in array - php

Related

Compare sentence with high similarity

PHP Count occurance of 2 lists of words appear string

Working with substr_count() and arrays in PHP

"Unfolding" a String

Affinity between a text and a list of keywords?

Categories

Resources