I'm trying to create a basic concordance script that will print the ten words before and after the value found inside an array. I did this by splitting the text into an array, identifying the position of the value, and then printing -10 and +10 with the searched value in the middle. However, this only presents the first such occurrence. I know I can find the others by using array_keys (found in positions 52, 78, 80), but I'm not quite sure how to cycle through the matches, since array_keys also results in an array. Thus, using $matches (with array_keys) in place of $location below doesn't work, since you cannot use the same operands on an array as an integer. Any suggestions? Thank you!!
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$new = explode(" ", $text);
$location = array_search("in", $new, FALSE);
$concordance = 10;
$top_range = $location + $concordance;
$bottom_range = $location - $concordance;
while($bottom_range <= $top_range) {
echo $new[$bottom_range] . " ";
$bottom_range++;
}
?>
You can just iterate over the values returned by array_keys, using array_slice to extract the $concordance words either side of the location and implode to put the sentence back together again:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
foreach (array_keys($words, 'in') as $idx) {
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
}
print_r($results);
Output:
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
[2] => able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned
)
If you want to avoid generating similar phrases where a word occurs twice within $concordance words (e.g. indexes 1 and 2 in the above array), you can maintain a position for the end of the last match, and skip occurrences that occur in that match:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
$last = 0;
foreach (array_keys($words, 'in') as $idx) {
if ($idx < $last) continue;
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
$last = $idx + $concordance;
}
print_r($results);
Output
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
)
Demo on 3v4l.org
Try this:
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$words = explode(" ", $text);
$concordance = 10; // range -+
$result = []; // result array
$index = 0;
if (count($words) === 0) // be sure there is no empty array
exit;
do {
$location = array_search("in", $words, false);
if (!$location) // break loop if $location not found
break;
$count = count($words);
// check range of array indexes
$minRange = ($location - $concordance > 0) ? ($location-$concordance) : 0; // array can't have index less than 0 (shorthand if)
$maxRange = (($location + $concordance) < ($count - 1)) ? ($location+$concordance) : $count - 1; // array can't have index equal or higher than array count (shorthand if)
for ($range = $minRange; $range < $maxRange; $range++) {
$result[$index][] = $words[$range]; // group words by index
}
unset($words[$location]); // delete element which contain "in"
$words = array_values($words); // reindex array
$index++;
} while ($location); // repeat until $location exist
print_r($result); // <--- here's your results
?>
Related
I have created a function which randomly generates a phrase from a hardcoded list of words. I have a function get_words() which has a string of hardcoded words, which it turns into an array then shuffles and returns.
get_words() is called by generate_random_phrase(), which iterates through get_words() n times, and on every iteration concatenates the n word into the final phrase which is destined to be returned to the user.
My problem is, for some reason PHP keeps giving me inconsistent results. It does give me words which are randomized, but it gives inconsistent number of words. I specify 4 words as the default and it gives me phrases ranging from 1-4 words instead of 4. This program is so simple it is almost unbelievable I can't pinpoint the exact issue. It seems like the broken link in the chain is the $words array which is being indexed, it seems like for some reason sometimes the indexing fails. I am unfamiliar with PHP, can someone explain this to me?
<?php
function generate_random_phrase() {
$words = get_words();
$number_of_words = get_word_count();
$phrase = "";
$symbols = "!##$%^&*()";
echo print_r($phrase);
for ($i = 0;$i < $number_of_words;$i++) {
$phrase .= " ".$words[$i];
}
if (isset($_POST['include_numbers']))
$phrase = $phrase.rand(0, 9);
if (isset($_POST['include_symbols']))
$phrase = $phrase.$symbols[rand(0, 9)];
return $phrase;
}
function get_word_count() {
if ($_POST['word_count'] < 1 || $_POST['word_count'] > 9)
$word_count = 4; #default
else
$word_count = $_POST['word_count'];
return $word_count;
}
function get_words() {
$BASE_WORDS = "my sentence really hope you
like narwhales bacon at midnight but only
ferver where can paper laptops spoon door knobs
head phones watches barbeque not say";
$words = explode(' ', $BASE_WORDS);
shuffle($words);
return $words;
}
?>
In $BASE_WORDS your tabs and new lines are occupying a space in the exploded array that's why. Remove the newlines and tabs and it'll generate the correct answer. Ie:
$BASE_WORDS = "my sentence really hope you like narwhales bacon at midnight but only ferver where can paper laptops spoon door knobs head phones watches barbeque not say";
Your function seems a bit inconsistent since you also include spaces inside the array, thats why when you included them, you include them in your loop, which seems to be 5 words (4 real words with one space index) is not really correct. You could just filter spaces also first, including whitespaces.
Here is the visual representation of what I mean:
Array
(
[0] => // hello im a whitespace, i should not be in here since im not really a word
[1] => but
[2] =>
[3] => bacon
[4] => spoon
[5] => head
[6] => barbeque
[7] =>
[8] =>
[9] => sentence
[10] => door
[11] => you
[12] =>
[13] => watches
[14] => really
[15] => midnight
[16] =>
So when you loop it, you include spaces, in this case. If you got a number of words of 5, you really dont get those 5 words, index 0 - 4 it will look like you only got 3 (1 => but, 3 => bacon, 4 => spoon).
Here is a modified version:
function generate_random_phrase() {
$words = get_words();
$number_of_words = get_word_count();
$phrase = "";
$symbols = "!##$%^&*()";
$words = array_filter(array_map('trim', $words)); // filter empty words
$phrase = implode(' ', array_slice($words, 0, $number_of_words)); // no need for a loop
// this simply gets the array from the first until the desired number of words (0,5 or 0,9 whatever)
// and then implode, just glues all the words with space
// so this ensure its always according to how many words you want
if (isset($_POST['include_numbers']))
$phrase = $phrase.rand(0, 9);
if (isset($_POST['include_symbols']))
$phrase = $phrase.$symbols[rand(0, 9)];
return $phrase;
}
Inconsistent spacing in your words list is the issue.
Here is a fix:
function get_words() {
$BASE_WORDS = "my|sentence|really|hope|you|
|like|narwhales|bacon|at|midnight|but|only|
|ferver|where|can|paper|laptops|spoon|door|knobs|
|head|phones|watches|barbeque|not|say";
$words = explode('|', $BASE_WORDS);
shuffle($words);
return $words;
}
EDIT 1 -since posting I have learnt that the underlying question is about how to find the CARTESIAN PRODUCT (now go google), but not only because I don't want every perm, I want to find the cartesian products that use the same subarray Key never more than once per permuation AND my 'extra' question then is more about how to minimise the workload that a cartesian product would require (accepting a small error rate, I have to say)-
Imagine... I have four cooks and four recipes, each cook has a score for each recipe and today I'd like each cook to make one dish (but no dish should be made twice) and the decision should be based on the best (highest total scores) permutation for all four (so maybe a cook won't make his personal best).
I have put the data into a multi-dimensional array as such
array(
array (1,2,3,4),
array (35,0,0,0),
array (36,33,1,1),
array (20,20,5,3)
)
it has the same number of valuepairs in each sub array as the number of sub-arrays (if that helps any)
in reality the number of sub-arrays would reach a maximum of 8 (max perms therefore =8!, approx 40,000 not 8^8 because many combinations are not allowed)
the choice of having the data in this format is flexible if that helps
I am trying to create a second array that would output the best (ie HIGHEST value) possible combination of the sub-arrays as per KEYs where only ONE of each subarray can be used
--so here each subarray[0][1][2][3] would be used once per permutation
and each subarrayKey [0][1][2][3] would be used once per permutaion, in my actual problem I'm using associated arrays, but that is extra to this issue.--
So the example would create an array as such
newArray (35,33,5,4) // note that [2][0] was not used
IDEALLY I would prefer to not produce the ALL perms but rather, SOMEHOW, discard many combinations that would clearly not be best fit.
Any ideas for how to start? I would accept pseudo code.
For an example on SO about Cartesian Product, see PHP 2D Array output all combinations
EDIT 2
for more on making cartesian products more efficient, and maybe why it has to be case specific if you want to see if you can cut corners (with risk) Efficient Cartesian Product algorithm
Apologies, but this is going to be more of a logic layout than code...
It's not quite clear to me whether the array(1,2,3,4) are the scores for the first dish or for the first cook, but I would probably use an array such that
$array[$cook_id][$dish_number] = $score;
asort() each array so that $array[$cook_id] = array($lowest_scored_dish,...,$highest);
Consider a weighted preference for a particular cook to make a dish to be the difference between the score of the best dish and another.
As a very simple example, cooks a,b,c and dishes 0,1,2
$array['a'] = array(0=>100, 1=>50, 2=>0); // cook a prefers 0 over 1 with weight 50, over 2 with weight 100
$array['b'] = array(0=>100, 1=>100, 2=>50); // cook b prefers 0,1 over 2 with weight 50
$array['c'] = array(0=>50, 1=>50, 2=>100); // cook c prefers 2 with weight 50
After asort():
$array['a'] = array(0=>100, 1=>50, 2=>0);
$array['b'] = array(0=>100, 1=>100, 2=>50);
$array['c'] = array(2=>100, 0=>50, 1=>50);
Start with cook 'a' who prefers dish 0 over his next best dish by 50 points (weight). Cook 'b' also prefers dih 0, but with a weight of 0 over the next dish. Therefore it's likely (though not yet certain that cook 'a' should make dish 0.
Consider dish 0 to be reserved and move on to cook 'b'. Excluding dish 0, cook 'b' prefers dish 1. No other cook prefers dish 1, so cook 'b' is assigned dish 1.
Cook 'c' gets dish 2 by default.
This is a VERY convenient example where each cook gets to cook something that's a personal max, but I hope it's illustrative of some logic that would work out.
Let's make it less convenient:
$array['a'] = array(0=>75, 1=>50, 2=>0);
$array['b'] = array(0=>100, 1=>50, 2=>50);
$array['c'] = array(0=>100, 1=>25, 2=>25);
Start again with cook 'a' and see that 0 is preferred, but this time with weight 25. Cook 'b' prefers with a weight of 50 and cook 'c' prefers with a weight of 75. Cook 'c' wins dish 0.
Going back to the list of available cooks, 'a' prefers 1 with a weight of 50, but 'b' prefers it with weight 0. 'a' gets dish 1 and 'b' gets dish 2.
This still doesn't take care of all complexities, but it's a step in the right direction. Sometimes the assumption made for the first cook/dish combination will be wrong.
WAY less convenient:
$array['a'] = array(0=>200, 1=>148, 2=>148, 3=>0);
$array['b'] = array(0=>200, 1=>149, 2=>0, 3=>0);
$array['c'] = array(0=>200, 1=>150, 2=>147, 3=>147);
$array['d'] = array(0=>69, 1=>18, 2=>16, 3=>15);
'a' gets 0 since that's the max and no one else who prefers 0 has a higher weight
'b' wins 1 with a weight of 149
'd' wins 2 since 'c' doesn't have a preference from the available options
'c' gets 3
score: 200+149+147+16 = 512
While that's a good guess that's gathered without checking every permutation, it may be wrong. From here, ask, "If one cook traded with any one other cook, would the total increase?"
The answer is YES, a(0)+d(2) = 200+16 = 216, but a(2)+d(0) = 148+69 = 217.
I'll leave it to you to write the code for the "best guess" using the weighted approach, but after that, here's a good start for you:
// a totally uneducated guess...
$picks = array(0=>'a', 1=>'b', 2=>'c', 3=>'d');
do {
$best_change = false;
$best_change_weight = 0;
foreach ($picks as $dish1 => $cook1) {
foreach ($picks as $dish2 => $cook2) {
if (($array[$cook1][$dish1] + $array[$cook2][$dish2]) <
($array[$cook1][$dish2] + $array[$cook2][$dish1]))
{
$old_score = $array[$cook1][$dish1] + $array[$cook2][$dish2];
$new_score = $array[$cook1][$dish2] + $array[$cook2][$dish1];
if (($new_score - $old_score) > $best_change_weight) {
$best_change_weight = $new_score - $old_score;
$best_change = $dish2;
}
}
}
if ($best_change !== false) {
$cook2 = $picks[$best_change];
$picks[$dish1] = $cook2;
$picks[$dish2] = $cook1;
break;
}
}
} while ($best_change !== false);
I can't find a counter example to show that this doesn't work, but I'm suspicious of the case where
($array[$cook1][$dish1] + $array[$cook2][$dish2])
==
($array[$cook1][$dish2] + $array[$cook2][$dish1])
Maybe someone else will follow up with an answer to this "What if?"
Given this matrix, where the items in brackets are the "picks"
[a1] a2 a3
b1 [b2] b3
c1 c2 [c3]
If a1 + b2 == a2 + b1, then 'a' and 'b' will not switch dishes. The case I'm not 100% sure about is if there exists a matrix such that this is a better choice:
a1 [a2] a3
b1 b2 [b3]
[c1] c2 c3
Getting from the first state to the second requires two switches, the first of which seems arbitrary since it doesn't change the total. But, only by going through this arbitrary change can the last switch be made.
I tried to find an example 3x3 such that based on the "weighted preference" model I wrote about above, the first would be selected, but also such that the real optimum selection is given by the second. I wasn't able to find an example, but that doesn't mean that it doesn't exist. I don't feel like doing more matrix algebra right now, but maybe someone will pick up where I left off. Heck, maybe the case doesn't exist, but I thought I should point out the concern.
If it does work and you start with the correct pick, the above code will only loop through 64 times (8x8) for 8 cooks/dishes. If the pick is not correct and the first cook has a change, then it will go up to 72. If the 8th cook has a change, it's up to 128. It's possible that the do-while will loop several times, but I doubt it will get up near the CPU cycles required to sum all of the 40k combinations.
I may have a starting point for you with this algorithm that tries to choose cooks based on their ratio of max score to sum of scores (thus trying to choose chefs who are really good at one recipe but bad at the rest of the recipes to do that recipe)
$cooks = array(
array(1,2,3,4),
array(35,0,0,0),
array(36,33,1,1),
array(20,20,5,3)
);
$results = array();
while (count($cooks)) {
$curResult = array(
'cookId' => -1,
'recipe' => -1,
'score' => -1,
'ratio' => -1
);
foreach ($cooks as $cookId => $scores) {
$max = max($scores);
$ratio = $max / array_sum($scores);
if ($ratio > $curResult['ratio']) {
$curResult['cookId'] = $cookId;
$curResult['ratio'] = $ratio;
foreach ($scores as $recipe => $score) {
if ($score == $max) {
$curResult['recipe'] = $recipe;
$curResult['score'] = $score;
}
}
}
}
$results[$curResult['recipe']] = $curResult['score'];
unset($cooks[$curResult['cookId']]);
foreach ($cooks as &$cook) {
unset($cook[$curResult['recipe']]);
}
}
For the dataset provided, it does find what seems to be the optimum answer (35,33,5,4). However, it is still not perfect, for example, with the array:
$cooks = array(
array(1,2,3,4),
array(35,0,33,0),
array(36,33,1,1),
array(20,20,5,3)
);
The ideal answer would be (20,33,33,4), however this algorithm would return (35,33,5,4).
But since the question was asking for ideas of where to start, I guess this at least might suffice as something to start from :P
Try this
$mainArr = array(
array (1,2,3,4) ,
array (35,0,0,0) ,
array (36,33,1,1) ,
array (20,20,5,3)
);
$i = 0;
foreach( $mainArr as $subArray )
{
foreach( $subArray as $key => $value)
{
$newArr[$key][$i]=$value;
$i++;
}
}
$finalArr = array();
foreach( $newArr as $newSubArray )
{
$finalArr[] = max($newSubArray);
}
print_r( $finalArr );
OK here is a solution that allows you to find the best permutation of one cook to one recipe and no cook works twice and no recipe is made twice.
Thanks for the code to calculate perm of arrays goes to o'reilly...
http://docstore.mik.ua/orelly/webprog/pcook/ch04_26.htm
CONSIDERATIONS:
The number of cooks and the number of recipes are the same.
Going above a 5 by 5 matrix as here will get very big very fast. (see part 2 to be posted shortly)
The logic:
A permutation of an array assigns a place as well as just being included (ie what a combination does), so why not then assign each key of such an array to a recipe, the permutation guarantees no cook is repeated and the keys guarantee no recipe is repeated.
Please let me know if there are improvements or errors in my thinking or my code but here it is!
<?php
function pc_next_permutation($p, $size) {
//this is from http://docstore.mik.ua/orelly/webprog/pcook/ch04_26.htm
// slide down the array looking for where we're smaller than the next guy
for ($i = $size - 1; $p[$i] >= $p[$i+1]; --$i) { }
// if this doesn't occur, we've finished our permutations
// the array is reversed: (1, 2, 3, 4) => (4, 3, 2, 1)
if ($i == -1) { return false; }
// slide down the array looking for a bigger number than what we found before
for ($j = $size; $p[$j] <= $p[$i]; --$j) { }
// swap them
$tmp = $p[$i]; $p[$i] = $p[$j]; $p[$j] = $tmp;
// now reverse the elements in between by swapping the ends
for (++$i, $j = $size; $i < $j; ++$i, --$j) {
$tmp = $p[$i]; $p[$i] = $p[$j]; $p[$j] = $tmp;
}
return $p;
}
$cooks[441] = array(340=>5,342=>43,343=>50,344=>9,345=>0);
$cooks[442] = array(340=>5,342=>-33,343=>-30,344=>29,345=>0);
$cooks[443] = array(340=>5,342=>3,343=>0,344=>9,345=>10,);
$cooks[444] = array(340=>25,342=>23,343=>20,344=>19,345=>20,);
$cooks[445] = array(340=>27,342=>27,343=>26,344=>39,345=>50,);
//a consideration: this solution requires that the number of cooks equal the number of recipes
foreach ($cooks as $cooksCode => $cooksProfile){
$arrayOfCooks[]=$cooksCode;
$arrayOfRecipes = (array_keys($cooksProfile));
}
echo "<br/> here is the array of the different cooks<br/>";
print_r($arrayOfCooks);
echo "<br/> here is the array of the different recipes<br/>";
print_r($arrayOfRecipes);
$set = $arrayOfCooks;
$size = count($set) - 1;
$perm = range(0, $size);
$j = 0;
do {
foreach ($perm as $i) { $perms[$j][] = $set[$i]; }
} while ($perm = pc_next_permutation($perm, $size) and ++$j);
echo "<br/> here are all the permutations of the cooks<br/>";
print_r($perms);
$bestCombo = 0;
foreach($perms as $perm){
$thisScore =0;
foreach($perm as $key =>$cook){
$recipe= $arrayOfRecipes[$key];
$cookScore =$cooks[$cook][$recipe];
$thisScore = $thisScore+$cookScore;
}
if ($thisScore>$bestCombo){
$bestCombo=$thisScore;
$bestArray= $perm;
}
}
echo "<br/> here is the very best array<br/>";
print_r ($bestArray);
echo "<br/> best recipe assignment value is:".$bestCombo."<br/><br/>";
?>
*I try to count the unique appearances of a substring inside a list of words *
So check the list of words and detect if in any words there are substrings based on min characters that occur multiple times and count them. I don't know any substrings.
This is a working solution where you know the substring but what if you do not know ?
Theres a Minimum Character count where words are based on.
Will find all the words where "Book" is a substring of the word. With below php function.
Wanted outcome instad:
book count (5)
stor count (2)
Given a string of length 100
book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789... ... 100
your algorithm could be:
Investigate substrings from different starting points and substring lengths.
You take all substrings starting from 0 with a length from 1-100, so: 0-1, 0-2, 0-3,... and see if any of those substrings accurs more than once in the overall string.
Progress through the string by starting at increasing positions, searching all substrings starting from 1, i.e. 1-2, 1-3, 1-4,... and so on until you reach 99-100.
Keep a table of all substrings and their number of occurances and you can sort them.
You can optimize by specifying a minimum and maximum length, which reduces your number of searches and hit accuracy quite dramatically. Additionally, once you find a substring save them in a array of searched substrings. If you encounter the substring again, skip it. (i.e. hits for book that you already counted you should not count again when you hit the next booksubstring). Furthermore you will never have to search strings that are longer than half of the total string.
For the example string you might run additional test for the uniquness of a string.
You'd have
o x ..
oo x 7
bo x 7
ok x 6
book x 5
booking x 2
bookingservice x 1
with disregarding stings shorter than 3 (and longer than half of total textstring), you'd get
book x 5
booking x 2
bookingservice x 1
which is already quite a plausible result.
[edit] This would obviously look through all of the string, not just natural words.
[edit] Normally I don't like writing code for OPs, but in this case I got a bit interested myself:
$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";
function search($string, $min = 4, $max = 16, $threshhold = 2) {
echo "<pre><br/>";
echo "searching <em>'$string'</em> for string occurances ";
echo "of length $min - $max: <br/>";
$hits = array();
$foundStrings = array();
// no string longer than half of the total string will be found twice
if ($max > strlen($string) / 2) {
$max = strlen($string);
}
// examin substrings:
// start from 0, 1, 2...
for ($start = 0; $start < $max; $start++) {
// and string length 1, 2, 3, ... $max
for ($length = $min; $length < strlen($string); $length++) {
// get the substring in question,
// but search for natural words (trim)
$substring = trim(substr($string, $start, $length));
// if substring was not counted yet,
// add the found count to the hits
if (!in_array($substring, $foundStrings)) {
preg_match_all("/$substring/i", $string, $matches);
$hits[$substring] = count($matches[0]);
}
}
}
// sort the hits array desc by number of hits
arsort($hits);
// remove substring hits with hits less that threshhold
foreach ($hits as $substring => $count) {
if ($count < $threshhold) {
unset($hits[$substring]);
}
}
print_r($hits);
}
search($string);
?>
The comments and variable names should make the code explain itself. $string would come for a read file in your case. This exmaple would output:
searching 'book bookshelf booking foobar bar booking selfservice
bookingservice cooking' for string occurances of length 4 - 16:
Array
(
[ook] => 6
[book] => 5
[boo] => 5
[bookin] => 3
[booking] => 3
[booki] => 3
[elf] => 2
)
Let me know how you implement it :)
This is my first approximation: unfinished, untested, has at least 1 bug, and is written in eiffel. Well I am not going to do all the work for you.
deferred class
SUBSTRING_COUNT
feature
threshold : INTEGER_32 =5
biggest_starting_substring_length(a,b:STRING):INTEGER_32
deferred
end
biggest_starting_substring(a,b:STRING):STRING
do
Result := a.substring(0,biggest_starting_substring_length(a,b))
end
make_list_of_substrings(a,b:STRING)
local
index:INTEGER_32
this_one: STRING
do
from
a_index := b_index + 1
invariant
a_index >=0 and a_index <= a.count
until
a_index >= a.count
loop
this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
if this_one.count > threshold then
list.extend (this_one)
end
variant
a.count - a_index
end
end -- biggest_substring
list : ARRAYED_LIST[STRING]
end
I am working on a Web Application that includes long listings of names. The client originally wanted to have the names split up into divs by letter so it is easy to jump to a particular name on the list.
Now, looking at the list, the client pointed out several letters that have only one or two names associated with them. He now wants to know if we can combine several consecutive letters if there are only a few names in each.
(Note that letters with no names are not displayed at all.)
What I do right now is have the database server return a sorted list, then keep a variable containing the current character. I run through the list of names, incrementing the character and printing the opening and closing div and ul tags as I get to each letter. I know how to adapt this code to combine some letters, however, the one thing I'm not sure about how to handle is whether a particular combination of letters is the best-possible one. In other words, say that I have:
A - 12 names
B - 2 names
C - 1 name
D - 1 name
E - 1 name
F - 23 names
I know how to end up with a group A-C and then have D by itself. What I'm looking for is an efficient way to realize that A should be by itself and then B-D should be together.
I am not really sure where to start looking at this.
If it makes any difference, this code will be used in a Kohana Framework module.
UPDATE 2012-04-04:
Here is a clarification of what I need:
Say the minimum number of items I want in a group is 30. Now say that letter A has 25 items, letters B, C, and D, have 10 items each, and letter E has 32 items. I want to leave A alone because it will be better to combine B+C+D. The simple way to combine them is A+B, C+D+E - which is not what I want.
In other words, I need the best fit that comes closest to the minimum per group.
If a letter contains more than 10 names, or whatever reasonable limit you set, do not combine it with the next one. However, if you start combining letters, you might have it run until 15 or so names are collected if you want, as long as no individual letter has more than 10. That's not a universal solution, but it's how I'd solve it.
I came up with this function using PHP.
It groups letters that combined have over $ammount names in it.
function split_by_initials($names,$ammount,$tollerance = 0) {
$total = count($names);
foreach($names as $name) {
$filtered[$name[0]][] = $name;
}
$count = 0;
$key = '';
$temp = array();
foreach ($filtered as $initial => $split) {
$count += count($split);
$temp = array_merge($split,$temp);
$key .= $initial.'-';
if ($count >= $ammount || $count >= $ammount - $tollerance) {
$result[$key] = $temp;
$count = 0;
$key = '';
$temp = array();
}
}
return $result;
}
the 3rd parameter is used for when you want to limit the group to a single letter that doesn't have the ammount specified but is close enough.
Something like
i want to split in groups of 30
but a has 25
to so, if you set a tollerance of 5, A will be left alone and the other letters will be grouped.
I forgot to mention but it returns a multi dimensional array with the letters it contains as key then the names it contains.
Something like
Array
(
[A-B-C-] => Array
(
[0] => Bandice Bergen
[1] => Arey Lowell
[2] => Carmen Miranda
)
)
It is not exactly what you needed but i think it's close enough.
Using the jsfiddle that mrsherman put, I came up with something that could work: http://jsfiddle.net/F2Ahh/
Obviously that is to be used as a pseudocode, some techniques to make it more efficient could be applied. But that gets the job done.
Javascrip Version: enhanced version with sort and symbols grouping
function group_by_initials(names,ammount,tollerance) {
tolerance=tollerance||0;
total = names.length;
var filtered={}
var result={};
$.each(names,function(key,value){
val=value.trim();
var pattern = /[a-zA-Z0-9&_\.-]/
if(val[0].match(pattern)) {
intial=val[0];
}
else
{
intial='sym';
}
if(!(intial in filtered))
filtered[intial]=[];
filtered[intial].push(val);
})
var count = 0;
var key = '';
var temp = [];
$.each(Object.keys(filtered).sort(),function(ky,value){
count += filtered[value].length;
temp = temp.concat(filtered[value])
key += value+'-';
if (count >= ammount || count >= ammount - tollerance) {
key = key.substring(0, key.length - 1);
result[key] = temp;
count = 0;
key = '';
temp = [];
}
})
return result;
}
I have a list of phrases and I want to know which two words occurred the most often in all of my phrases.
I tried playing with regex and other codes and I just cannot find the right way to do this.
Can anyone help?
eg:
I am purchasing a wallet
a wallet for 20$
purchasing a bag
I'd know that
a wallet occurred 2 times
purchasing a occurred 2 times
<?
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
//split string into words
$words = explode(' ', $string);
//make chunks block ie [0,1][2,3]...
$chunks = array_chunk($words, 2);
//remove first array element
unset($words[0]);
//make chunks block ie [0,1][2,3]...
//but since first element is removed , the real block will be [1,2][3,4]...
$alternateChunks = array_chunk($words, 2);
//merge both chunks
$totalChunks = array_merge($chunks,$alternateChunks);
$finalChunks = array();
foreach($totalChunks as $t)
{
//change the inside chunk to pharse using +
//+ can be replaced to space, if neeced
//to keep associative working + is used instead of white space
$finalChunks[] = implode('+', $t);
}
//count the words inside array
$result = array_count_values($finalChunks);
echo "<pre>";
print_r($result);
I hesitate to suggest this, as it's an extremely brute force way to go about it:
Take your string of words, explode it using the explode(" ", $string); command, then run it through a for loop checking every two word combination against every two words in the string.
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
$words = explode(" ", $string);
for ($t=0; $t<count($string); $t++)
{
for ($i=0; $i<count($string); $i++)
{
if (($words[$t] . words[$t+1]) == ($words[$i] . $word[$i+1])) {$count[$words[$i].$words[$i+1]]++}
}
}
So the nested for loop steps in, grabs the first two words, compares them to each other set of two consecutive words, then grabs the next two words and does it again. Every answer will have an answer of at least 1 (it will always match itself) but sorting the resulting array by size will give you the most repeated values.
Note that this will run (n-1)*(n-1) iterations, which could get unwieldy FAST.
Place them all into an array, and access them by the current word index and next word index.
I think this should do the trick. It will grab pairs of words, unless you are at the end of the string, where you'll get only one word.
$str = "I purchased a wallet because I wanted a wallet a wallet a wallet";
$words = explode(" ", $str);
$array_results = array();
for ($i = 0; $i<count($words); $i++) {
if ($i < count($words)-1) {
$pair = $words[$i] . " " . $words[$i+1]; echo $pair . "\n";
// Have to check if the key is in use yet to avoid a notice
$array_results[$pair] = isset($array_results[$pair]) ? $array_results[$pair] + 1 : 1;
}
// At the end of the array, just use a single word
else $array_results[$words[$i]] = isset($array_results[$words[$i]]) ? $array_results[$words[$i]] + 1 : 1;
}
// Sort the results
// use arsort() instead to get the highest first
asort($array_results);
// Prints:
Array
(
[I wanted] => 1
[wanted a] => 1
[wallet] => 1
[because I] => 1
[wallet because] => 1
[I purchased] => 1
[purchased a] => 1
[wallet a] => 2
[a wallet] => 4
)
Update changed ++ to +1 above since it wasn't working when tested...
Try to put it with explode into an array and count the values with array_count_values.
<?php
$text = "whatever";
$text_array = explode( ' ', $text);
$double_words = array();
for($c = 1; $c < count($text_array); $c++)
{
$double_words[] = $text_array[$c -1] . ' ' . $text_array[$c];
}
$result = array_count_values($double_words);
?>
I updated it now to two word version. Does this work for you?
array(9) {
["I am"]=> int(1)
["am purchasing"]=> int(1)
["purchasing a"]=> int(2)
["a wallet"]=> int(2)
["wallet a"]=> int(1)
["wallet for"]=> int(1)
["for 20$"]=> int(1)
["20$ purchasing"]=> int(1)
["a bag"]=> int(1)
}
Since you used the excel tag, I thought I'd give it a shot, and it's actually really easy.
Split string using space as delimiter. Data > Text to Columns... > Delimited > Delimiter: Space. Each word is now in its own cell.
Transpose the result (not strictly required but much easier to visualize). Copy, Edit > Paste Special... > Transpose.
Make cells containing consecutive word pairs. So if your words are in cells B5:B15, cell C5 should be =B5&" "&B6 (and drag down).
Count occurence of each word pair: In cell D5, =COUNTIF($C$5:$C$15,"="&C5), drag down.
Highlight the winner(s). Select C5:D15, Format > Conditional Formatting... > Formula Is =$D5=MAX($D$5:$D$15) and choose e.g. a yellow background.
Note that there is some inefficiency in step 4 because the count of each word pair will be calculated multiple times if that word pair occurs multiple times. If this is a concern, then you can first make a list of unique word pairs using Data > Filter > Advanced Filter... > Unique records only.
An automated VBA solution could easily be crafted by recording a macro of the above followed by some minor editing.
One way to go about it is to use SPLIT or a regex to split the sentences into words and store each into an array. Then take the array and create a dictionary object. When you add a term to the dictionary, if it's already there, add 1 to the .value to tally the count.
Here is some example code (far from perfect as it's just to show the overlying concept) that will take all the string in column A and generate a word frequency list in columns B and C. It's not exactly what you want, but should give you some ideas on how you can go about doing it I hope:
Sub FrequencyList()
Dim vArray As Variant
Dim myDict As Variant
Set myDict = CreateObject("Scripting.Dictionary")
Dim i As Long
Dim cell As range
With myDict
For Each cell In range("A1", cells(Rows.count, "A").End(xlUp))
vArray = Split(cell.Value, " ")
For i = LBound(vArray) To UBound(vArray)
If Not .exists(vArray(i)) Then
.Add vArray(i), 1
Else
.Item(vArray(i)) = .Item(vArray(i)) + 1
End If
Next
Next
range("B1").Resize(.count).Value = Application.Transpose(.keys)
range("C1").Resize(.count).Value = Application.Transpose(.items)
End With
End Sub