Finding the 3 most occurring substrings in a string in PHP

Finding the 3 most occurring substrings in a string in PHP - php

I wanted to go over the thought process of these since I am not sure how to improve this. I have a string that are separated by commas and they have reoccurring substrings and I want to find the 3 most occurring substrings.
I was going to explode the string by commas into an array.
Perform a substr_count in the original string for each element in the array and store it in a separate array to store the counts? (Not sure how to improve this since that would create duplicate counts for the same substring)
Perform a max on the array to find the first, second, and third most occurring substrings.
Return an array with the first, second, and third most occurring substrings.
I am guessing after I perform an explode, I can do a quick sort and go from there?
This is what I have tried so far:
$result = findThreeMostOccuringStrings("apple, apple, berry, cherry, cherry, cherry, dog, dog, dog");
var_dump($result);
function findThreeMostOccuringStrings($str){
$first = PHP_INT_MIN;
$second = PHP_INT_MIN;
$third = PHP_INT_MIN;
$arr = explode(",", $str);
for ($i = 0; $i < count($str); $i++){
$arrIdx[] = substr_count($arr[$i]);
}
$first = max($arrIdx);
$arrIdx[$first] = -1;
$second = max($arrIdx);
$arrIdx[$first] = -1;
$third = max($arrIdx);
$arrIdx[$first] = -1;
$threeMostOccuringStrings = array($first, $second, $third);
return $threeMostOccuringStrings;
}

If by substring you mean only the strings separated by commas and not substrings of these, use array_count_values after explode

If you're looking for an efficient way to solve substring search, the answer is a Trie, or prefix tree. This basically reduces the look up time to search for a substring as it creates a deterministic path for all prefixes (i.e. a prefix tree).
Consider the string "A cat does not categorize its food while among other cats." Here the substrings categorize and cats all share the same prefix of cat. So finding the most frequented substring in a prefix tree is easy if you count the number of EOW nodes stemming from each branch in the root.
Building the tree is also quite trivial.
function insertString($str, $trie) {
$str = strtolower($str); // normalize case
$node = $trie;
foreach(str_split($str) as $letter) {
if (!isset($node->$letter)) {
$node->$letter = new stdClass;
}
$node = $node->$letter;
}
$node->EOW = true; // place EOL node
}
function countNodes($branch) {
$n = 0;
foreach($branch as $e => $node) {
if ($node instanceof stdClass) {
$n += countNodes($node);
} elseif ($e === 'EOW') {
$n++;
}
}
return $n;
}
$trie = new stdClass;
$str = "A cat does not categorize its food while among other cats";
foreach(explode(' ', $str) as $word) {
insertString($word, $trie);
}
$s = [];
foreach($trie as $n => $rootNodes) {
$s[$n] = countNodes($rootNodes);
}
var_dump($s);
Which should give you...
array(8) {
["a"]=>
int(2)
["c"]=>
int(3)
["d"]=>
int(1)
["n"]=>
int(1)
["i"]=>
int(1)
["f"]=>
int(1)
["w"]=>
int(1)
["o"]=>
int(1)
}
From there you can see the root branch c has the highest number of substrings (which if walked match cat, cats, and categorize.

Based on your post and the comments, you actually want to tally terms in a comma delimited list of terms, and then rank them, so: let's just do that, using an associative array for tallying and arsort to sort that associative array by value, in reverse order (so that the highest counts are at the start of the array):
function rank($input) {
$terms = explode(',', $input);
$ranked = array();
foreach($terms as $word) {
$word = trim($word);
if (!isset($ranked[$word])) {
$ranked[$word] = 0;
}
$ranked[$word]++;
}
arsort($ranked);
return $ranked;
}
So if we run that through print_r(rank("apple, apple, berry, cherry, cherry, cherry, dog, dog, dog")) we get:
Array
(
[dog] => 3
[cherry] => 3
[apple] => 2
[berry] => 1
)
Splendid.

Related

similar substring in other string PHP

How to check substrings in PHP by prefix or postfix.
For example, I have the search string named as $to_search as follows:
$to_search = "abcdef"
And three cases to check the if that is the substring in $to_search as follows:
$cases = ["abc def", "def", "deff", ... Other values ...];
Now I have to detect the first three cases using substr() function.
How can I detect the "abc def", "def", "deff" as substring of "abcdef" in PHP.

You might find the Levenshtein distance between the two words useful - it'll have a value of 1 for abc def. However your problem is not well defined - matching strings that are "similar" doesn't mean anything concrete.
Edit - If you set the deletion cost to 0 then this very closely models the problem you are proposing. Just check that the levenshtein distance is less than 1 for everything in the array.

This will find if any of the strings inside $cases are a substring of $to_search.
foreach($cases as $someString){
if(strpos($to_search, $someString) !== false){
// $someString is found inside $to_search
}
}
Only "def" is though as none of the other strings have much to do with each other.
Also on a side not; it is prefix and suffix not postfix.

To find any of the cases that either begin with or end with either the beginning or ending of the search string, I don't know of another way to do it than to just step through all of the possible beginning and ending combinations and check them. There's probably a better way to do this, but this should do it.
$to_search = "abcdef";
$cases = ["abc def", "def", "deff", "otherabc", "noabcmatch", "nodefmatch"];
$matches = array();
$len = strlen($to_search);
for ($i=1; $i <= $len; $i++) {
// get the beginning and end of the search string of length $i
$pre_post = array();
$pre_post[] = substr($to_search, 0, $i);
$pre_post[] = substr($to_search, -$i);
foreach ($cases as $case) {
// get the beginning and end of each case of length $i
$pre = substr($case, 0, $i);
$post = substr($case, -$i);
// check if any of them match
if (in_array($pre, $pre_post) || in_array($post, $pre_post)) {
// using the case as the array key for $matches will keep it distinct
$matches[$case] = true;
}
}
}
// use array_keys() to get the keys back to values
var_dump(array_keys($matches));

You can use array_filter function like this:
$cases = ["cake", "cakes", "flowers", "chocolate", "chocolates"];
$to_search = "chocolatecake";
$search = strtolower($to_search);
$arr = array_filter($cases, function($val) use ($search) { return
strpos( $search,
str_replace(' ', '', preg_replace('/s$/', '', strtolower($val))) ) !== FALSE; });
print_r($arr);
Output:
Array
(
[0] => cake
[1] => cakes
[3] => chocolate
[4] => chocolates
)
As you can it prints all the values you expected apart from deff which is not part of search string abcdef as I commented above.

PHP strpos() function comparision of string conflicts in numbers

I am using strpos() function to find the string in an array key members.
foreach($_POST as $k => $v){
// for all types of questions of chapter 1 only
$count = 0;
if(strpos($k, 'chap1') !== false){
$count++;
}
}
I know that it works only until the keys are (chap1e1, chap1m1, chap1h1) but when it comes to (chap10e1, chap10m1, chap10h1), my logic won't be working on those.
Isn't there any way, so that, I can distinguish the comparison between (chap1 & chap10)?
Or, Is there any alternative way of doing this? Please give me some ideas on it. Thank you!

Basically, preg_match would do just that:
$count = 0;
foreach($_POST as $k => $v)
{
if (preg_match('/\bchap1[^\d]{0,1}/', $k)) ++$count;
}
How the pattern works:
\b: a word-boundary. matches chap1, but not schap, it can't be part of a bigger string
chap1: matches a literal string (because it's preceded by \b, this literal can't be preceded by a char, but it can be the beginning of a string, for example
[^\d]{0,1}: Matches anything except numbers zero or one times. so chap10 is not a match, but chap1e is
To deal with all of these "chapters" at once, try this:
$count = array();
foreach($_POST as $k => $v)
{
if (preg_match('/\bchap(\d+)(.*)/', $k, $match))
{
$match[2] = $match[2] ? $match[2] : 'empty';//default value
if (!isset($count[$match[1]])) $count[$match[1]] = array();
$count[$match[1]][] = $match[2];
}
}
Now this pattern is a bit more complex, but not much
\bchap: same as before, wourd boundary + literal
(\d+): Match AND GROUP all numbers (one or more, no numbers aren't accepted). We use this group as key later on ($match[1])
(.*): match and group the rest of the string. If there's nothing there, that's OK. If you don't want to match keys like chap1, and require something after the digits, replace the * asterisk with a plus sign
Now, I've turned the $count variable into an array, that will look like this:
array('1' => array('e1', 'a3'),
'10'=> array('s3')
);
When $_POST looks something like this:
array(
chap1e1 => ?,
chap1a3 => ?,
chap10s3=> ?
)
What you do with the values is up to you. One thing you could do is to group the key-value pairs per "chapter"
$count = array();
foreach($_POST as $k => $v)
{
if (preg_match('/\bchap(\d+)/', $k, $match))
{
if (!isset($count[$match[1]])) $count[$match[1]] = array();
$count[$match[1]][$k] = $v;//$match[1] == chap -> array with full post key->value pairs
}
}
Note that, if this is a viable solution for you, it's not a bad idea to simplify the expression (because regex's are best kept simple), and just omit the (.*) at the end.
With the code above, to get the count of any "chap\d" param, simply use:
echo 'count 1: ', isset($count[1]) ? count($count[1]) : 0, PHP_EOL;

you may need tweaking the reg ex code,anyway this will give a start
if (preg_match("/chap[0-9]{1,3}/i", $v, $match)) {
$count++;
}

You can use:
$rest = substr($k, 0, 5);
and then compare your string like ($rest !== 'chap1')
I hope this works.

I have tested the below code on Execute PHP Online
<?php
$_POST['chap1e1'] = 'test1';
$_POST['chap10e1'] = 'test2';
foreach($_POST as $k => $v){
// for all types of questions of chapter 1 only
$count = 0;
var_dump($k);
var_dump(strpos($k, 'chap1'));
var_dump(strpos($k, 'chap1') !== false);
}
foreach($_POST as $k => $v){
// for all types of questions of chapter 1 only
//$count = 0;
var_dump($count);
if(strpos($k, 'chap1') !== false){
$count++;
}
}
echo $count;
?>
And the get the below output
string(7) "chap1e1" int(0) bool(true) string(8) "chap10e1" int(0) bool(true) int(0) int(1) 2
Indicating the strpos is able to locate "chap1" in "chap10e1"
but the $count total is wrong because your code always reset $count to 0 inside the foreach loop

"Unfolding" a String

I have a set of strings, each string has a variable number of segments separated by pipes (|), e.g.:
$string = 'abc|b|ac';
Each segment with more than one char should be expanded into all the possible one char combinations, for 3 segments the following "algorithm" works wonderfully:
$result = array();
$string = explode('|', 'abc|b|ac');
foreach (str_split($string[0]) as $i)
{
foreach (str_split($string[1]) as $j)
{
foreach (str_split($string[2]) as $k)
{
$result[] = implode('|', array($i, $j, $k)); // more...
}
}
}
print_r($result);
Output:
$result = array('a|b|a', 'a|b|c', 'b|b|a', 'b|b|c', 'c|b|a', 'c|b|c');
Obviously, for more than 3 segments the code starts to get extremely messy, since I need to add (and check) more and more inner loops. I tried coming up with a dynamic solution but I can't figure out how to generate the correct combination for all the segments (individually and as a whole). I also looked at some combinatorics source code but I'm unable to combine the different combinations of my segments.
I appreciate if anyone can point me in the right direction.

Recursion to the rescue (you might need to tweak a bit to cover edge cases, but it works):
function explodinator($str) {
$segments = explode('|', $str);
$pieces = array_map('str_split', $segments);
return e_helper($pieces);
}
function e_helper($pieces) {
if (count($pieces) == 1)
return $pieces[0];
$first = array_shift($pieces);
$subs = e_helper($pieces);
foreach($first as $char) {
foreach ($subs as $sub) {
$result[] = $char . '|' . $sub;
}
}
return $result;
}
print_r(explodinator('abc|b|ac'));
Outputs:
Array
(
[0] => a|b|a
[1] => a|b|c
[2] => b|b|a
[3] => b|b|c
[4] => c|b|a
[5] => c|b|c
)
As seen on ideone.

This looks like a job for recursive programming! :P
I first looked at this and thought it was going to be a on-liner (and probably is in perl).
There are other non-recursive ways (enumerate all combinations of indexes into segments then loop through, for example) but I think this is more interesting, and probably 'better'.
$str = explode('|', 'abc|b|ac');
$strlen = count( $str );
$results = array();
function splitAndForeach( $bchar , $oldindex, $tempthread) {
global $strlen, $str, $results;
$temp = $tempthread;
$newindex = $oldindex + 1;
if ( $bchar != '') { array_push($temp, $bchar ); }
if ( $newindex <= $strlen ){
print "starting foreach loop on string '".$str[$newindex-1]."' \n";
foreach(str_split( $str[$newindex - 1] ) as $c) {
print "Going into next depth ($newindex) of recursion on char $c \n";
splitAndForeach( $c , $newindex, $temp);
}
} else {
$found = implode('|', $temp);
print "Array length (max recursion depth) reached, result: $found \n";
array_push( $results, $found );
$temp = $tempthread;
$index = 0;
print "***************** Reset index to 0 *****************\n\n";
}
}
splitAndForeach('', 0, array() );
print "your results: \n";
print_r($results);

You could have two arrays: the alternatives and a current counter.
$alternatives = array(array('a', 'b', 'c'), array('b'), array('a', 'c'));
$counter = array(0, 0, 0);
Then, in a loop, you increment the "last digit" of the counter, and if that is equal to the number of alternatives for that position, you reset that "digit" to zero and increment the "digit" left to it. This works just like counting with decimal numbers.
The string for each step is built by concatenating the $alternatives[$i][$counter[$i]] for each digit.
You are finished when the "first digit" becomes as large as the number of alternatives for that digit.
Example: for the above variables, the counter would get the following values in the steps:
0,0,0
0,0,1
1,0,0 (overflow in the last two digit)
1,0,1
2,0,0 (overflow in the last two digits)
2,0,1
3,0,0 (finished, since the first "digit" has only 3 alternatives)

Random But Unique Pairings, with Conditions

I need some help/direction in setting up a PHP script to randomly pair up items in an array.
The items should be randomly paired up each time.
The items should not match themselves ( item1-1 should not pair up with item1-1 )
Most of the items have a mate (ie. item1-1 and item1-2). The items should not be paired with their mate.
I've been playing around with the second script in this post but, I haven't been able to make any progress. Any help is appreciated.

Very simple approach, but hopefully helpful to you:
(mates, if grouped in an array (e.g. array('a1', 'a2')), will not be paired.)
function matchUp($array) {
$result = array();
while($el = array_pop($array)) {
shuffle($array);
if (sizeof($array) > 0) {
$candidate = array_pop($array);
$result[] = array(
array_pop($el),
array_pop($candidate)
);
if (sizeof($el) > 0) {
$array[] = $el;
}
if (sizeof($candidate) > 0) {
$array[] = $candidate;
}
}
else {
$result[] = array(array_pop($el));
}
}
return $result;
}
$array = array(
array('a1', 'a2'),
array('b1', 'b2'),
array('c1'),
array('d1'),
array('e1', 'e2'),
array('f1'),
array('g1', 'g2'),
);
Update:
foreach(matchUp($array) as $pair) {
list($a, $b) = $pair + array(null, null);
echo '<div style="border: solid 1px #000000;">' . $a . ' + ' . $b . '</div>';
}

With the randomness, there is no guarantee that a full correct solution will be reached.
Certain problem sets are more likely to be solved than others. Some will be impossible.
You can configure how many times it will try to achieve a good solution. After the specified number of tries it will return the best solution it could find.
function pairUp (array $subjectArray) {
// Config options
$tries = 50;
// Variables
$bestPaired = array();
$bestUnpaired = array();
for($try = 1; $try <= 50; $try++) {
$paired = array();
$unpaired = array();
$toBePaired = $subjectArray;
foreach($subjectArray as $subjectIndex => $subjectValue) {
// Create array without $thisValue anywhere, from the unpaired items
$cleanArray = array();
foreach($toBePaired as $index => $value) {
if($value != $subjectValue) {
array_push($cleanArray, array(
'index' => $index,
'value' => $value
));
}
}
sort($cleanArray); // reset indexes in array
// See if we have any different values left to match
if(count($cleanArray) == 0) {
array_push($unpaired, $subjectValue);
continue;
}
// Get a random item from the clean array
$randomIndex = rand(0,count($cleanArray)-1);
// Store this pair
$paired[$subjectIndex] = $subjectValue . '-' . $cleanArray[$randomIndex]['value'];
// This item has been paired, remove it from unpairedItems
unset($toBePaired[$cleanArray[$randomIndex]['index']]);
sort($toBePaired);
}
// Decide if this is our best try
if(count($paired) > count($bestPaired)) {
$bestPaired = $paired;
$bestUnpaired = $unpaired;
}
// If we had no failures, this was a perfect try - finish
if(count($unpaired) == 0) { $break; }
}
// We're done, send our array of pairs back.
return array(
'paired' => $bestPaired,
'unpaired' => $bestUnpaired
);
}
var_dump(pairUp(array('a','b','c','d','e','a','b','c','d','e')));
/*
Example output:
array(2) {
["paired"]=>
array(10) {
[0]=>
string(3) "a-b"
[1]=>
string(3) "b-c"
[2]=>
string(3) "c-d"
[3]=>
string(3) "d-e"
[4]=>
string(3) "e-a"
[5]=>
string(3) "a-b"
[6]=>
string(3) "b-e"
[7]=>
string(3) "c-d"
[8]=>
string(3) "d-c"
[9]=>
string(3) "e-a"
}
["unpaired"]=>
array(0) {
}
}
*/

Case 1: if all elements had a mate
If all elements had a mate, the following solution would work, although I don't know if it would be perfectly random (as in, all possible outputs having the same probability):
Shuffle the list of elements, keeping mates together
original list = (a1,a2),(b1,b2),(c1,c2),(d1,d2)
shuffled = (c1,c2),(d1,d2),(a1,a2),(b1,b2)
Shift the second mate to the right. The matches have been formed.
shifted = (c1,b2),(d1,c2),(a1,d2),(b1,a2)
(Edit1: if applied exactly as described, there is no way a1 ends up matched with b1. So, before shifting, you may want to throw a coin for each pair of mates to decide whether they should change their order or not.)
Case 2: if only some elements have a mate
Since in your question only some elements will have a mate, I guess one could come up with the following:
Arbitrarily pair up those elements who don't have a mate. There should be an even number of such elements. Otherwise, the total number of elements would be odd, so no matching could be done in the first place.
original list = (a1,a2),(b1,b2),c1,d1,e1,f1 // c1,d1,e1 and f1 don't have mates
list2 = (a1,a2),(b1,b2),(c1,d1),(e1,f1) // pair them up
Shuffle and shift as in case 1 to form the matches.
shuffled = (e1,f1),(a1,a2),(c1,d1),(b1,b2)
shifted = (e1,b2),(a1,f1),(c1,a2),(b1,d1)
Again, I don't know if this is perfectly random, but I think it should work.
(Edit2: simplified the solution)
(Edit3: if the total number of elements is odd, someone will be left without a match, so pick an element randomly at the beginning to leave it out and then apply the algorithm above).

Convert and reconvert a version to number to store in database

is there any algorithm to convert an string like 1.0.0 to a sortable number via PHP?
It should be able to convert to same string again. It's not possible to just remove dots. Also length of version is unknown, for example 1.0.0, 11.222.0, 0.8.1526

If you just want to sort versions, there is no need to convert.
<?php
$versions = array('1.0.0', '11.222.0', '0.8.1256');
usort($versions, 'version_compare');
var_dump($versions);
array(3) {
[0]=>
string(8) "0.8.1256"
[1]=>
string(5) "1.0.0"
[2]=>
string(8) "11.222.0"
}

If you want to compare versions numbers, you could just use the version_compare() function.
And if you have an array of versions that you need to sort, you could use a function such as usort() / uasort(), with a callback based on version_compare().

If you insist on an arbitrary length there is no way to uniquely map the numbers with at the same time maintaining the ordering criterion. Maybe you just want to sort the version numbers without conversion (see other answers)?

If you expect version segmentation with numbers like 12345 (eg. 0.9.12345.2), then you may be best off exploding the string and storing each segment in separate field in SQL.
That way you can sort it how ever you wish.

One option would be using explode:
function cmp($a, $b)
{
$a = explode('.', $a);
$b = explode('.', $b);
$m = min(count($a), count($b));
for ($i = 0; $i < $m; $i++) {
if (intval($a[$i]) < intval($b[$i]))
return -1;
else
return 1;
}
return 0;
}
EDIT: Didn't know about version_compare, that might be a better option if it works as you need.

Here are a couple of functions that convert version to string and vice-versa.
So you can store the strings in your database and be able to sort them. I've used a length of 5 char but you can adapt to your needs.
function version_to_str($version) {
$list = explode('.', $version);
$str = '';
foreach ($list as $element) {
$str .= sprintf('%05d', $element);
}
return $str;
}
function str_to_version($str) {
$version = array();
for ($i=0; $i<strlen($str); $i+=5) {
$version[] = intval(substr($str, $i, 5));
}
return implode('.', $version);
}
$versions = array('1.0.0', '11.222.0', '0.8.1526');
$versions = array_map("version_to_str", $versions);
sort($versions);
$versions = array_map("str_to_version", $versions);
print_r($versions);
output:
Array
(
[0] => 0.8.1526
[1] => 1.0.0
[2] => 11.222.0
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Finding the 3 most occurring substrings in a string in PHP - php

If by substring you mean only the strings separated by commas and not substrings of these, use array_count_values after explode

Related

similar substring in other string PHP

PHP strpos() function comparision of string conflicts in numbers

"Unfolding" a String

Random But Unique Pairings, with Conditions

Convert and reconvert a version to number to store in database

Categories

Resources