Get the two most frequent words within several strings - php

I have a list of phrases and I want to know which two words occurred the most often in all of my phrases.
I tried playing with regex and other codes and I just cannot find the right way to do this.
Can anyone help?
eg:
I am purchasing a wallet
a wallet for 20$
purchasing a bag
I'd know that
a wallet occurred 2 times
purchasing a occurred 2 times

<?
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
//split string into words
$words = explode(' ', $string);
//make chunks block ie [0,1][2,3]...
$chunks = array_chunk($words, 2);
//remove first array element
unset($words[0]);
//make chunks block ie [0,1][2,3]...
//but since first element is removed , the real block will be [1,2][3,4]...
$alternateChunks = array_chunk($words, 2);
//merge both chunks
$totalChunks = array_merge($chunks,$alternateChunks);
$finalChunks = array();
foreach($totalChunks as $t)
{
//change the inside chunk to pharse using +
//+ can be replaced to space, if neeced
//to keep associative working + is used instead of white space
$finalChunks[] = implode('+', $t);
}
//count the words inside array
$result = array_count_values($finalChunks);
echo "<pre>";
print_r($result);

I hesitate to suggest this, as it's an extremely brute force way to go about it:
Take your string of words, explode it using the explode(" ", $string); command, then run it through a for loop checking every two word combination against every two words in the string.
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
$words = explode(" ", $string);
for ($t=0; $t<count($string); $t++)
{
for ($i=0; $i<count($string); $i++)
{
if (($words[$t] . words[$t+1]) == ($words[$i] . $word[$i+1])) {$count[$words[$i].$words[$i+1]]++}
}
}
So the nested for loop steps in, grabs the first two words, compares them to each other set of two consecutive words, then grabs the next two words and does it again. Every answer will have an answer of at least 1 (it will always match itself) but sorting the resulting array by size will give you the most repeated values.
Note that this will run (n-1)*(n-1) iterations, which could get unwieldy FAST.

Place them all into an array, and access them by the current word index and next word index.
I think this should do the trick. It will grab pairs of words, unless you are at the end of the string, where you'll get only one word.
$str = "I purchased a wallet because I wanted a wallet a wallet a wallet";
$words = explode(" ", $str);
$array_results = array();
for ($i = 0; $i<count($words); $i++) {
if ($i < count($words)-1) {
$pair = $words[$i] . " " . $words[$i+1]; echo $pair . "\n";
// Have to check if the key is in use yet to avoid a notice
$array_results[$pair] = isset($array_results[$pair]) ? $array_results[$pair] + 1 : 1;
}
// At the end of the array, just use a single word
else $array_results[$words[$i]] = isset($array_results[$words[$i]]) ? $array_results[$words[$i]] + 1 : 1;
}
// Sort the results
// use arsort() instead to get the highest first
asort($array_results);
// Prints:
Array
(
[I wanted] => 1
[wanted a] => 1
[wallet] => 1
[because I] => 1
[wallet because] => 1
[I purchased] => 1
[purchased a] => 1
[wallet a] => 2
[a wallet] => 4
)
Update changed ++ to +1 above since it wasn't working when tested...

Try to put it with explode into an array and count the values with array_count_values.
<?php
$text = "whatever";
$text_array = explode( ' ', $text);
$double_words = array();
for($c = 1; $c < count($text_array); $c++)
{
$double_words[] = $text_array[$c -1] . ' ' . $text_array[$c];
}
$result = array_count_values($double_words);
?>
I updated it now to two word version. Does this work for you?
array(9) {
["I am"]=> int(1)
["am purchasing"]=> int(1)
["purchasing a"]=> int(2)
["a wallet"]=> int(2)
["wallet a"]=> int(1)
["wallet for"]=> int(1)
["for 20$"]=> int(1)
["20$ purchasing"]=> int(1)
["a bag"]=> int(1)
}

Since you used the excel tag, I thought I'd give it a shot, and it's actually really easy.
Split string using space as delimiter. Data > Text to Columns... > Delimited > Delimiter: Space. Each word is now in its own cell.
Transpose the result (not strictly required but much easier to visualize). Copy, Edit > Paste Special... > Transpose.
Make cells containing consecutive word pairs. So if your words are in cells B5:B15, cell C5 should be =B5&" "&B6 (and drag down).
Count occurence of each word pair: In cell D5, =COUNTIF($C$5:$C$15,"="&C5), drag down.
Highlight the winner(s). Select C5:D15, Format > Conditional Formatting... > Formula Is =$D5=MAX($D$5:$D$15) and choose e.g. a yellow background.
Note that there is some inefficiency in step 4 because the count of each word pair will be calculated multiple times if that word pair occurs multiple times. If this is a concern, then you can first make a list of unique word pairs using Data > Filter > Advanced Filter... > Unique records only.
An automated VBA solution could easily be crafted by recording a macro of the above followed by some minor editing.

One way to go about it is to use SPLIT or a regex to split the sentences into words and store each into an array. Then take the array and create a dictionary object. When you add a term to the dictionary, if it's already there, add 1 to the .value to tally the count.
Here is some example code (far from perfect as it's just to show the overlying concept) that will take all the string in column A and generate a word frequency list in columns B and C. It's not exactly what you want, but should give you some ideas on how you can go about doing it I hope:
Sub FrequencyList()
Dim vArray As Variant
Dim myDict As Variant
Set myDict = CreateObject("Scripting.Dictionary")
Dim i As Long
Dim cell As range
With myDict
For Each cell In range("A1", cells(Rows.count, "A").End(xlUp))
vArray = Split(cell.Value, " ")
For i = LBound(vArray) To UBound(vArray)
If Not .exists(vArray(i)) Then
.Add vArray(i), 1
Else
.Item(vArray(i)) = .Item(vArray(i)) + 1
End If
Next
Next
range("B1").Resize(.count).Value = Application.Transpose(.keys)
range("C1").Resize(.count).Value = Application.Transpose(.items)
End With
End Sub

Related

I want to count the duplicate words from uploaded file in php how i can get it

I want to count the duplicate words from an uploaded file in PHP, how do I perform this task?
Assuming we are working with a text file, this is a relatively simple task:
<?php
// Get the contents of the file
$contents = "Duplicate duplicate duplicate three times three is twenty six thousand three hundred and fifty one. There are fifty nine people in the universe. Times that by nine and divide it by three, then negate a million and you can calculate the IQ of donald trump. This is pure waffle and I can repeat this nine times but shall restrain and instead talk like ollie and use big words and sound very cool. Funny thing is I've typed n instead of and so many times because it's a habit n I like eating rabbit meat as it is very succulent and juicy. Duplicate words shall be found!";
// Split the contents into an array of individual words
$words = explode(' ', $contents);
// Define arrays to track occurrences and duplicates
$occurrences = [];
$duplicates = [];
// Iterate through each word in the sample
foreach ($words as $word) {
// Convert word to lower case (case-insensitivity)
$word = strtolower($word);
// Increment the current occurrence count of current word
$occurrences[$word] = isset($occurrences[$word]) ? $occurrences[$word] + 1 : 1;
// If the word has occurred more than once, add it to our duplicates
// Remove the in_array call if you wish to count each instance of a duplicated word instead of once per duplicate
if ($occurrences[$word] > 1 && !in_array($word, $duplicates)) {
$duplicates[] = $word;
}
}
// Output the duplicates in a comma separated format
echo "Duplicates in file: " . join(", ", $duplicates);

PHP - get total number of array items with a specific number sequence in value

So, basically I'm trying to count the number of landline phone numbers in a list of both landlines and mobile phone numbers $mobile_list (071234567890,02039989435,0781...)
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
echo $landlines["020..."]; // print the number of numbers
So, I get the basic count specific elements function, but I don't see where I can specify if an element 'starts with' or 'contains' a sequence. With the above you can only specify an exact phone number (obviously not useful).
Any help would be great!
I don't see any reason to first explode the string to an array, and then check each array item.
That is a complete waste of performance!
I suggest using preg_match_all and match with word boundary "020".
That means the "word" has to start with 020.
$mobile_list = "071234567890,02039989435,0781,020122,123020";
preg_match_all("/\b020\d+\b/", $mobile_list, $m);
var_dump($m);
echo count($m[0]); // 2
https://3v4l.org/ucSDm
The lightest and fastest method I have found is to explode on ",020".
The array that is returned has item 0 as undefined, meaning we don't know if it's a 020 number so I have to look at that manually.
$temp = explode(",020", $mobile_list);
$cnt = count($temp);
if(substr($temp[0],0,3) != "020") $cnt--;
echo $cnt;
A small scale test shows this as the fastest method.
https://3v4l.org/rD54d
You can use array_reduce() to count the occurrences of strings beginning with '020'
$mobile_list = "02039619491,07143502893,02088024526,07351261813,02095694897";
$mobile_array = explode(',', $mobile_list);
function landlineCount($carry, $item)
{
if (substr($item, 0, 3) === '020') {
return $carry += 1;
}
return $carry;
}
$count = array_reduce($mobile_array, 'landlineCount');
echo $count;
prints 3
I'm sure the OP has finished what they needed to do hours ago but for fun here is a faster way to count the landlines.
I hadn't spotted that the question original code was exploding the string.
That isn't necessary, you can just count the sub strings with substr_count() this could miss the first which wouldn't have a comma before it so I check for that too with substr().
If you need the total count of all numbers you can just count the commas with substr_count() again and add one.
$count = substr($mobile_list, 0, 3) === '020' ? 1 : 0;
$count += substr_count($mobile_list, ",020");
$totalCount = substr_count($mobile_list, ",") + 1;
echo $count;
echo $totalCount;
Here is the bench run a 1000 times to get an average.
https://3v4l.org/Sma66
Use array_filter() or preg_grep() functions to find all numbers that contain or starts with given number sequence.
Note: There is easier and better solution in other answers that cover request to find values that start with given number sequence.
Because you have mentioned - "but I don't see where I can specify if an element 'starts with' or 'contains' a sequence." - My code assumes that you wan't to find any occurrence of sequence, not only in start of string of each item.
$mobile_list = '02000, 02032435, 039002300, 00305600';
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
$sequence = '020'; // print the number of numbers
function filter_phone_numbers($mobile_array, $sequence){
return array_filter($mobile_array, function ($item) use ($sequence) {
if (stripos($item, $sequence) !== false) {
return true;
}
return false;
});
}
$filtered_items = array_unique (filter_phone_numbers($mobile_array, $sequence)); //use array_unique in case we find same number that both contains or starts with sequence
echo count($filtered_items);
Or with preg_grep():
$mobile_list = '02000, 02032435, 039002300, 00305600';
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
$sequence = preg_quote('020', '~'); ; // print the number of numbers
function grep_phone_numbers($mobile_array, $sequence){
return preg_grep('~' . $sequence . '~', $mobile_array);
}
//use array_unique in case we find same number that both contains or starts with sequence
$filtered_items = array_unique(grep_phone_numbers($mobile_array, $sequence));
echo count($filtered_items);
I recommend doing this with the database. The database is design to manage data and can do it a lot more efficient than PHP can. You can simply put it into a query and just get the result you want in 1 go:
SELECT * FROM phone_numbers WHERE number LIKE '020%'
If you get the data from the database anyways, that LIKE adds a little time to the query, but less that it takes PHP to loop, strpos and store the results. Also, as you return a smaller dataset, less resources are being used.

Can the for loop be eliminated from this piece of PHP code?

I have a range of whole numbers that might or might not have some numbers missing. Is it possible to find the smallest missing number without using a loop structure? If there are no missing numbers, the function should return the maximum value of the range plus one.
This is how I solved it using a for loop:
$range = [0,1,2,3,4,6,7];
// sort just in case the range is not in order
asort($range);
$range = array_values($range);
$first = true;
for ($x = 0; $x < count($range); $x++)
{
// don't check the first element
if ( ! $first )
{
if ( $range[$x - 1] + 1 !== $range[$x])
{
echo $range[$x - 1] + 1;
break;
}
}
// if we're on the last element, there are no missing numbers
if ($x + 1 === count($range))
{
echo $range[$x] + 1;
}
$first = false;
}
Ideally, I'd like to avoid looping completely, as the range can be massive. Any suggestions?
Algo solution
There is a way to check if there is a missing number using an algorithm. It's explained here. Basically if we need to add numbers from 1 to 100. We don't need to calculate by summing them we just need to do the following: (100 * (100 + 1)) / 2. So how is this going to solve our issue ?
We're going to get the first element of the array and the last one. We calculate the sum with this algo. We then use array_sum() to calculate the actual sum. If the results are the same, then there is no missing number. We could then "backtrack" the missing number by substracting the actual sum from the calculated one. This of course only works if there is only one number missing and will fail if there are several missing. So let's put this in code:
$range = range(0,7); // Creating an array
echo check($range) . "\r\n"; // check
unset($range[3]); // unset offset 3
echo check($range); // check
function check($array){
if($array[0] == 0){
unset($array[0]); // get ride of the zero
}
sort($array); // sorting
$first = reset($array); // get the first value
$last = end($array); // get the last value
$sum = ($last * ($first + $last)) / 2; // the algo
$actual_sum = array_sum($array); // the actual sum
if($sum == $actual_sum){
return $last + 1; // no missing number
}else{
return $sum - $actual_sum; // missing number
}
}
Output
8
3
Online demo
If there are several numbers missing, then just use array_map() or something similar to do an internal loop.
Regex solution
Let's take this to a new level and use regex ! I know it's nonsense, and it shouldn't be used in real world application. The goal is to show the true power of regex :)
So first let's make a string out of our range in the following format: I,II,III,IIII for range 1,3.
$range = range(0,7);
if($range[0] === 0){ // get ride of 0
unset($range[0]);
}
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
echo $str;
The output should be something like: I,II,III,IIII,IIIII,IIIIII,IIIIIII.
I've come up with the following regex: ^(?=(I+))(^\1|,\2I|\2I)+$. So what does this mean ?
^ # match begin of string
(?= # positive lookahead, we use this to not "eat" the match
(I+) # match I one or more times and put it in group 1
) # end of lookahead
( # start matching group 2
^\1 # match begin of string followed by what's matched in group 1
| # or
,\2I # match a comma, with what's matched in group 2 (recursive !) and an I
| # or
\2I # match what's matched in group 2 and an I
)+ # repeat one or more times
$ # match end of line
Let's see what's actually happening ....
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^
(I+) do not eat but match I and put it in group 1
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^
^\1 match what was matched in group 1, which means I gets matched
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^ ,\2I match what was matched in group 1 (one I in thise case) and add an I to it
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^^ \2I match what was matched previously in group 2 (,II in this case) and add an I to it
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^^^ \2I match what was matched previously in group 2 (,III in this case) and add an I to it
We're moving forward since there is a + sign which means match one or more times,
this is actually a recursive regex.
We put the $ to make sure it's the end of string
If the number of I's don't correspond, then the regex will fail.
See it working and failing. And Let's put it in PHP code:
$range = range(0,7);
if($range[0] === 0){
unset($range[0]);
}
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
if(preg_match('#^(?=(I*))(^\1|,\2I|\2I)+$#', $str)){
echo 'works !';
}else{
echo 'fails !';
}
Now let's take in account to return the number that's missing, we will remove the $ end character to make our regex not fail, and we use group 2 to return the missed number:
$range = range(0,7);
if($range[0] === 0){
unset($range[0]);
}
unset($range[2]); // remove 2
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
preg_match('#^(?=(I*))(^\1|,\2I|\2I)+#', $str, $m); // REGEEEEEX !!!
$n = strlen($m[2]); //get the length ie the number
$sum = array_sum($range); // array sum
if($n == $sum){
echo $n + 1; // no missing number
}else{
echo $n - 1; // missing number
}
Online demo
EDIT: NOTE
This question is about performance. Functions like array_diff and array_filter are not magically fast. They can add a huge time penalty. Replacing a loop in your code with a call to array_diff will not magically make things fast, and will probably make things slower. You need to understand how these functions work if you intend to use them to speed up your code.
This answer uses the assumption that no items are duplicated and no invalid elements exist to allow us to use the position of the element to infer its expected value.
This answer is theoretically the fastest possible solution if you start with a sorted list. The solution posted by Jack is theoretically the fastest if sorting is required.
In the series [0,1,2,3,4,...], the n'th element has the value n if no elements before it are missing. So we can spot-check at any point to see if our missing element is before or after the element in question.
So you start by cutting the list in half and checking to see if the item at position x = x
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
Yup, list[4] == 4. So move halfway from your current point the end of the list.
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
Uh-oh, list[6] == 7. So somewhere between our last checkpoint and the current one, one element was missing. Divide the difference in half and check that element:
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
In this case, list[5] == 5
So we're good there. So we take half the distance between our current check and the last one that was abnormal. And oh.. it looks like cell n+1 is one we already checked. We know that list[6]==7 and list[5]==5, so the element number 6 is the one that's missing.
Since each step divides the number of elements to consider in half, you know that your worst-case performance is going to check no more than log2 of the total list size. That is, this is an O(log(n)) solution.
If this whole arrangement looks familiar, It's because you learned it back in your second year of college in a Computer Science class. It's a minor variation on the binary search algorithm--one of the most widely used index schemes in the industry. Indeed this question appears to be a perfectly-contrived application for this searching technique.
You can of course repeat the operation to find additional missing elements, but since you've already tested the values at key elements in the list, you can avoid re-checking most of the list and go straight to the interesting ones left to test.
Also note that this solution assumes a sorted list. If the list isn't sorted then obviously you sort it first. Except, binary searching has some notable properties in common with quicksort. It's quite possible that you can combine the process of sorting with the process of finding the missing element and do both in a single operation, saving yourself some time.
Finally, to sum up the list, that's just a stupid math trick thrown in for good measure. The sum of a list of numbers from 1 to N is just N*(N+1)/2. And if you've already determined that any elements are missing, then obvously just subtract the missing ones.
Technically, you can't really do without the loop (unless you only want to know if there's a missing number). However, you can accomplish this without first sorting the array.
The following algorithm uses O(n) time with O(n) space:
$range = [0, 1, 2, 3, 4, 6, 7];
$N = count($range);
$temp = str_repeat('0', $N); // assume all values are out of place
foreach ($range as $value) {
if ($value < $N) {
$temp[$value] = 1; // value is in the right place
}
}
// count number of leading ones
echo strspn($temp, '1'), PHP_EOL;
It builds an ordered identity map of N entries, marking each value against its position as "1"; in the end all entries must be "1", and the first "0" entry is the smallest value that's missing.
Btw, I'm using a temporary string instead of an array to reduce physical memory requirements.
I honestly don't get why you wouldn't want to use a loop. There's nothing wrong with loops. They're fast, and you simply can't do without them. However, in your case, there is a way to avoid having to write your own loops, using PHP core functions. They do loop over the array, though, but you simply can't avoid that.
Anyway, I gather what you're after, can easily be written in 3 lines:
function highestPlus(array $in)
{
$compare = range(min($in), max($in));
$diff = array_diff($compare, $in);
return empty($diff) ? max($in) +1 : $diff[0];
}
Tested with:
echo highestPlus(range(0,11));//echoes 12
$arr = array(9,3,4,1,2,5);
echo highestPlus($arr);//echoes 6
And now, to shamelessly steal Pé de Leão's answer (but "augment" it to do exactly what you want):
function highestPlus(array $range)
{//an unreadable one-liner... horrid, so don't, but know that you can...
return min(array_diff(range(0, max($range)+1), $range)) ?: max($range) +1;
}
How it works:
$compare = range(min($in), max($in));//range(lowest value in array, highest value in array)
$diff = array_diff($compare, $in);//get all values present in $compare, that aren't in $in
return empty($diff) ? max($in) +1 : $diff[0];
//-------------------------------------------------
// read as:
if (empty($diff))
{//every number in min-max range was found in $in, return highest value +1
return max($in) + 1;
}
//there were numbers in min-max range, not present in $in, return first missing number:
return $diff[0];
That's it, really.
Of course, if the supplied array might contain null or falsy values, or even strings, and duplicate values, it might be useful to "clean" the input a bit:
function highestPlus(array $in)
{
$clean = array_filter(
$in,
'is_numeric'//or even is_int
);
$compare = range(min($clean), max($clean));
$diff = array_diff($compare, $clean);//duplicates aren't an issue here
return empty($diff) ? max($clean) + 1; $diff[0];
}
Useful links:
The array_diff man page
The max and min functions
Good Ol' range, of course...
The array_filter function
The array_map function might be worth a look
Just as array_sum might be
$range = array(0,1,2,3,4,6,7);
// sort just in case the range is not in order
asort($range);
$range = array_values($range);
$indexes = array_keys($range);
$diff = array_diff($indexes,$range);
echo $diff[0]; // >> will print: 5
// if $diff is an empty array - you can print
// the "maximum value of the range plus one": $range[count($range)-1]+1
echo min(array_diff(range(0, max($range)+1), $range));
Simple
$array1 = array(0,1,2,3,4,5,6,7);// array with actual number series
$array2 = array(0,1,2,4,6,7); // array with your custom number series
$missing = array_diff($array1,$array2);
sort($missing);
echo $missing[0];
$range = array(0,1,2,3,4,6,7);
$max=max($range);
$expected_total=($max*($max+1))/2; // sum if no number was missing.
$actual_total=array_sum($range); // sum of the input array.
if($expected_total==$actual_total){
echo $max+1; // no difference so no missing number, then echo 1+ missing number.
}else{
echo $expected_total-$actual_total; // the difference will be the missing number.
}
you can use array_diff() like this
<?php
$range = array("0","1","2","3","4","6","7","9");
asort($range);
$len=count($range);
if($range[$len-1]==$len-1){
$r=$range[$len-1];
}
else{
$ref= range(0,$len-1);
$result = array_diff($ref,$range);
$r=implode($result);
}
echo $r;
?>
function missing( $v ) {
static $p = -1;
$d = $v - $p - 1;
$p = $v;
return $d?1:0;
}
$result = array_search( 1, array_map( "missing", $ARRAY_TO_TEST ) );

Count unique appearance of substring in a list of words without knowing the substr?

*I try to count the unique appearances of a substring inside a list of words *
So check the list of words and detect if in any words there are substrings based on min characters that occur multiple times and count them. I don't know any substrings.
This is a working solution where you know the substring but what if you do not know ?
Theres a Minimum Character count where words are based on.
Will find all the words where "Book" is a substring of the word. With below php function.
Wanted outcome instad:
book count (5)
stor count (2)
Given a string of length 100
book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789... ... 100
your algorithm could be:
Investigate substrings from different starting points and substring lengths.
You take all substrings starting from 0 with a length from 1-100, so: 0-1, 0-2, 0-3,... and see if any of those substrings accurs more than once in the overall string.
Progress through the string by starting at increasing positions, searching all substrings starting from 1, i.e. 1-2, 1-3, 1-4,... and so on until you reach 99-100.
Keep a table of all substrings and their number of occurances and you can sort them.
You can optimize by specifying a minimum and maximum length, which reduces your number of searches and hit accuracy quite dramatically. Additionally, once you find a substring save them in a array of searched substrings. If you encounter the substring again, skip it. (i.e. hits for book that you already counted you should not count again when you hit the next booksubstring). Furthermore you will never have to search strings that are longer than half of the total string.
For the example string you might run additional test for the uniquness of a string.
You'd have
o x ..
oo x 7
bo x 7
ok x 6
book x 5
booking x 2
bookingservice x 1
with disregarding stings shorter than 3 (and longer than half of total textstring), you'd get
book x 5
booking x 2
bookingservice x 1
which is already quite a plausible result.
[edit] This would obviously look through all of the string, not just natural words.
[edit] Normally I don't like writing code for OPs, but in this case I got a bit interested myself:
$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";
function search($string, $min = 4, $max = 16, $threshhold = 2) {
echo "<pre><br/>";
echo "searching <em>'$string'</em> for string occurances ";
echo "of length $min - $max: <br/>";
$hits = array();
$foundStrings = array();
// no string longer than half of the total string will be found twice
if ($max > strlen($string) / 2) {
$max = strlen($string);
}
// examin substrings:
// start from 0, 1, 2...
for ($start = 0; $start < $max; $start++) {
// and string length 1, 2, 3, ... $max
for ($length = $min; $length < strlen($string); $length++) {
// get the substring in question,
// but search for natural words (trim)
$substring = trim(substr($string, $start, $length));
// if substring was not counted yet,
// add the found count to the hits
if (!in_array($substring, $foundStrings)) {
preg_match_all("/$substring/i", $string, $matches);
$hits[$substring] = count($matches[0]);
}
}
}
// sort the hits array desc by number of hits
arsort($hits);
// remove substring hits with hits less that threshhold
foreach ($hits as $substring => $count) {
if ($count < $threshhold) {
unset($hits[$substring]);
}
}
print_r($hits);
}
search($string);
?>
The comments and variable names should make the code explain itself. $string would come for a read file in your case. This exmaple would output:
searching 'book bookshelf booking foobar bar booking selfservice
bookingservice cooking' for string occurances of length 4 - 16:
Array
(
[ook] => 6
[book] => 5
[boo] => 5
[bookin] => 3
[booking] => 3
[booki] => 3
[elf] => 2
)
Let me know how you implement it :)
This is my first approximation: unfinished, untested, has at least 1 bug, and is written in eiffel. Well I am not going to do all the work for you.
deferred class
SUBSTRING_COUNT
feature
threshold : INTEGER_32 =5
biggest_starting_substring_length(a,b:STRING):INTEGER_32
deferred
end
biggest_starting_substring(a,b:STRING):STRING
do
Result := a.substring(0,biggest_starting_substring_length(a,b))
end
make_list_of_substrings(a,b:STRING)
local
index:INTEGER_32
this_one: STRING
do
from
a_index := b_index + 1
invariant
a_index >=0 and a_index <= a.count
until
a_index >= a.count
loop
this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
if this_one.count > threshold then
list.extend (this_one)
end
variant
a.count - a_index
end
end -- biggest_substring
list : ARRAYED_LIST[STRING]
end

Listing by alphabet, groups letters with few entries together (PHP or JS)

I am working on a Web Application that includes long listings of names. The client originally wanted to have the names split up into divs by letter so it is easy to jump to a particular name on the list.
Now, looking at the list, the client pointed out several letters that have only one or two names associated with them. He now wants to know if we can combine several consecutive letters if there are only a few names in each.
(Note that letters with no names are not displayed at all.)
What I do right now is have the database server return a sorted list, then keep a variable containing the current character. I run through the list of names, incrementing the character and printing the opening and closing div and ul tags as I get to each letter. I know how to adapt this code to combine some letters, however, the one thing I'm not sure about how to handle is whether a particular combination of letters is the best-possible one. In other words, say that I have:
A - 12 names
B - 2 names
C - 1 name
D - 1 name
E - 1 name
F - 23 names
I know how to end up with a group A-C and then have D by itself. What I'm looking for is an efficient way to realize that A should be by itself and then B-D should be together.
I am not really sure where to start looking at this.
If it makes any difference, this code will be used in a Kohana Framework module.
UPDATE 2012-04-04:
Here is a clarification of what I need:
Say the minimum number of items I want in a group is 30. Now say that letter A has 25 items, letters B, C, and D, have 10 items each, and letter E has 32 items. I want to leave A alone because it will be better to combine B+C+D. The simple way to combine them is A+B, C+D+E - which is not what I want.
In other words, I need the best fit that comes closest to the minimum per group.
If a letter contains more than 10 names, or whatever reasonable limit you set, do not combine it with the next one. However, if you start combining letters, you might have it run until 15 or so names are collected if you want, as long as no individual letter has more than 10. That's not a universal solution, but it's how I'd solve it.
I came up with this function using PHP.
It groups letters that combined have over $ammount names in it.
function split_by_initials($names,$ammount,$tollerance = 0) {
$total = count($names);
foreach($names as $name) {
$filtered[$name[0]][] = $name;
}
$count = 0;
$key = '';
$temp = array();
foreach ($filtered as $initial => $split) {
$count += count($split);
$temp = array_merge($split,$temp);
$key .= $initial.'-';
if ($count >= $ammount || $count >= $ammount - $tollerance) {
$result[$key] = $temp;
$count = 0;
$key = '';
$temp = array();
}
}
return $result;
}
the 3rd parameter is used for when you want to limit the group to a single letter that doesn't have the ammount specified but is close enough.
Something like
i want to split in groups of 30
but a has 25
to so, if you set a tollerance of 5, A will be left alone and the other letters will be grouped.
I forgot to mention but it returns a multi dimensional array with the letters it contains as key then the names it contains.
Something like
Array
(
[A-B-C-] => Array
(
[0] => Bandice Bergen
[1] => Arey Lowell
[2] => Carmen Miranda
)
)
It is not exactly what you needed but i think it's close enough.
Using the jsfiddle that mrsherman put, I came up with something that could work: http://jsfiddle.net/F2Ahh/
Obviously that is to be used as a pseudocode, some techniques to make it more efficient could be applied. But that gets the job done.
Javascrip Version: enhanced version with sort and symbols grouping
function group_by_initials(names,ammount,tollerance) {
tolerance=tollerance||0;
total = names.length;
var filtered={}
var result={};
$.each(names,function(key,value){
val=value.trim();
var pattern = /[a-zA-Z0-9&_\.-]/
if(val[0].match(pattern)) {
intial=val[0];
}
else
{
intial='sym';
}
if(!(intial in filtered))
filtered[intial]=[];
filtered[intial].push(val);
})
var count = 0;
var key = '';
var temp = [];
$.each(Object.keys(filtered).sort(),function(ky,value){
count += filtered[value].length;
temp = temp.concat(filtered[value])
key += value+'-';
if (count >= ammount || count >= ammount - tollerance) {
key = key.substring(0, key.length - 1);
result[key] = temp;
count = 0;
key = '';
temp = [];
}
})
return result;
}

Categories