pronounceability algorithm - php

I am struggling to find/create an algorithm that can determine the pronounceability of random 5 letter combinations.
The closest thing I've found so far is from this 3 year old StackOverflow thread:
Measure the pronounceability of a word?
<?php
// Score: 1
echo pronounceability('namelet') . "\n";
// Score: 0.71428571428571
echo pronounceability('nameoic') . "\n";
function pronounceability($word) {
static $vowels = array
(
'a',
'e',
'i',
'o',
'u',
'y'
);
static $composites = array
(
'mm',
'll',
'th',
'ing'
);
if (!is_string($word)) return false;
// Remove non letters and put in lowercase
$word = preg_replace('/[^a-z]/i', '', $word);
$word = strtolower($word);
// Special case
if ($word == 'a') return 1;
$len = strlen($word);
// Let's not parse an empty string
if ($len == 0) return 0;
$score = 0;
$pos = 0;
while ($pos < $len) {
// Check if is allowed composites
foreach ($composites as $comp) {
$complen = strlen($comp);
if (($pos + $complen) < $len) {
$check = substr($word, $pos, $complen);
if ($check == $comp) {
$score += $complen;
$pos += $complen;
continue 2;
}
}
}
// Is it a vowel? If so, check if previous wasn't a vowel too.
if (in_array($word[$pos], $vowels)) {
if (($pos - 1) >= 0 && !in_array($word[$pos - 1], $vowels)) {
$score += 1;
$pos += 1;
continue;
}
} else { // Not a vowel, check if next one is, or if is end of word
if (($pos + 1) < $len && in_array($word[$pos + 1], $vowels)) {
$score += 2;
$pos += 2;
continue;
} elseif (($pos + 1) == $len) {
$score += 1;
break;
}
}
$pos += 1;
}
return $score / $len;
}
?>
... but it is far from perfect, giving some rather strange false positives:
Using this function, all of the following rate as pronounceable, (above 7/10)
ZTEDA
LLFDA
MMGDA
THHDA
RTHDA
XYHDA
VQIDA
Can someone smarter than me tweek this algorithm perhaps so that:
'MM', 'LL', and 'TH' are only valid when followed or preceeded by a
vowel?
3 or more consonants in a row is a no-no, (except when the first or
last is an 'R' or 'L')
any other refinements you can think of...
(I have done a fair amount of research/googling, and this seems to be the main pronounceability function that everyone has been referencing/using for the last 3 years, so I'm sure an updated, more refined version would be appreciated by the wider community, not just me!).

Based on a suggestion on the linked question to "Use a Markov model on letters"
Use a Markov model (on letters, not words, of course). The probability of a word is a pretty good proxy for ease of pronunciation.
I thought I would try it out and had some success.
My Methodology
I copied a list of real 5-letter words into a file to serve as my dataset (here...um, actually here).
Then I use a Hidden Markov model (based on One-grams, Bi-grams, and Tri-grams) to predict how likely a target word would appear in that dataset.
(Better results could be achieved with some sort of phonetic transcription as one of the steps.)
First, I calculate the probabilities of character sequences in the dataset.
For example, if 'A' occurs 50 times, and there is only 250 characters in the dataset, then 'A' has a 50/250 or .2 probability.
Do the same for the bigrams 'AB', 'AC', ...
Do the same for the trigrams 'ABC', 'ABD', ...
Basically, my score for the word "ABCDE" is composed of:
prob( 'A' )
prob( 'B' )
prob( 'C' )
prob( 'D' )
prob( 'E' )
prob( 'AB' )
prob( 'BC' )
prob( 'CD' )
prob( 'DE' )
prob( 'ABC' )
prob( 'BCD' )
prob( 'CDE' )
You could multiply all of these together to get the estimated probability of the target word appearing in the dataset, (but that is very small).
So instead, we take the logs of each and add them together.
Now we have a score which estimates how likely our target word would appear in the dataset.
My code
I have coded this is C#, and find that a score greater than negative 160 is pretty good.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Pronouncability
{
class Program
{
public static char[] alphabet = new char[]{ 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z' };
public static List<string> wordList = loadWordList(); //Dataset of 5-letter words
public static Random rand = new Random();
public const double SCORE_LIMIT = -160.00;
/// <summary>
/// Generates random words, until 100 of them are better than
/// the SCORE_LIMIT based on a statistical score.
/// </summary>
public static void Main(string[] args)
{
Dictionary<Tuple<char, char, char>, int> trigramCounts = new Dictionary<Tuple<char, char, char>, int>();
Dictionary<Tuple<char, char>, int> bigramCounts = new Dictionary<Tuple<char, char>, int>();
Dictionary<char, int> onegramCounts = new Dictionary<char, int>();
calculateProbabilities(onegramCounts, bigramCounts, trigramCounts);
double totalTrigrams = (double)trigramCounts.Values.Sum();
double totalBigrams = (double)bigramCounts.Values.Sum();
double totalOnegrams = (double)onegramCounts.Values.Sum();
SortedList<double, string> randomWordsScores = new SortedList<double, string>();
while( randomWordsScores.Count < 100 )
{
string randStr = getRandomWord();
if (!randomWordsScores.ContainsValue(randStr))
{
double score = getLikelyhood(randStr,trigramCounts, bigramCounts, onegramCounts, totalTrigrams, totalBigrams, totalOnegrams);
if (score > SCORE_LIMIT)
{
randomWordsScores.Add(score, randStr);
}
}
}
//Right now randomWordsScores contains 100 random words which have
//a better score than the SCORE_LIMIT, sorted from worst to best.
}
/// <summary>
/// Generates a random 5-letter word
/// </summary>
public static string getRandomWord()
{
char c0 = (char)rand.Next(65, 90);
char c1 = (char)rand.Next(65, 90);
char c2 = (char)rand.Next(65, 90);
char c3 = (char)rand.Next(65, 90);
char c4 = (char)rand.Next(65, 90);
return "" + c0 + c1 + c2 + c3 + c4;
}
/// <summary>
/// Returns a score for how likely a given word is, based on given trigrams, bigrams, and one-grams
/// </summary>
public static double getLikelyhood(string wordToScore, Dictionary<Tuple<char, char,char>, int> trigramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<char, int> onegramCounts, double totalTrigrams, double totalBigrams, double totalOnegrams)
{
wordToScore = wordToScore.ToUpper();
char[] letters = wordToScore.ToCharArray();
Tuple<char, char>[] bigrams = new Tuple<char, char>[]{
new Tuple<char,char>( wordToScore[0], wordToScore[1] ),
new Tuple<char,char>( wordToScore[1], wordToScore[2] ),
new Tuple<char,char>( wordToScore[2], wordToScore[3] ),
new Tuple<char,char>( wordToScore[3], wordToScore[4] )
};
Tuple<char, char, char>[] trigrams = new Tuple<char, char, char>[]{
new Tuple<char,char,char>( wordToScore[0], wordToScore[1], wordToScore[2] ),
new Tuple<char,char,char>( wordToScore[1], wordToScore[2], wordToScore[3] ),
new Tuple<char,char,char>( wordToScore[2], wordToScore[3], wordToScore[4] ),
};
double score = 0;
foreach (char c in letters)
{
score += Math.Log((((double)onegramCounts[c]) / totalOnegrams));
}
foreach (Tuple<char, char> pair in bigrams)
{
score += Math.Log((((double)bigramCounts[pair]) / totalBigrams));
}
foreach (Tuple<char, char, char> trio in trigrams)
{
score += 5.0*Math.Log((((double)trigramCounts[trio]) / totalTrigrams));
}
return score;
}
/// <summary>
/// Build the probability tables based on the dataset (WordList)
/// </summary>
public static void calculateProbabilities(Dictionary<char, int> onegramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<Tuple<char, char, char>, int> trigramCounts)
{
foreach (char c1 in alphabet)
{
foreach (char c2 in alphabet)
{
foreach( char c3 in alphabet)
{
trigramCounts[new Tuple<char, char, char>(c1, c2, c3)] = 1;
}
}
}
foreach( char c1 in alphabet)
{
foreach( char c2 in alphabet)
{
bigramCounts[ new Tuple<char,char>(c1,c2) ] = 1;
}
}
foreach (char c1 in alphabet)
{
onegramCounts[c1] = 1;
}
foreach (string word in wordList)
{
for (int pos = 0; pos < 3; pos++)
{
trigramCounts[new Tuple<char, char, char>(word[pos], word[pos + 1], word[pos + 2])]++;
}
for (int pos = 0; pos < 4; pos++)
{
bigramCounts[new Tuple<char, char>(word[pos], word[pos + 1])]++;
}
for (int pos = 0; pos < 5; pos++)
{
onegramCounts[word[pos]]++;
}
}
}
/// <summary>
/// Get the dataset (WordList) from file.
/// </summary>
public static List<string> loadWordList()
{
string filePath = "WordList.txt";
string text = File.ReadAllText(filePath);
List<string> result = text.Split(' ').ToList();
return result;
}
}
}
In my example, I scale the trigram probabilities by 5.
I also add one to all of the counts, so we don't multiply by zero.
Final notes
I'm not a php programmer, but the technique is pretty easy to implement.
Play around with some scaling factors, try different datasets, or add in some other checks like what you suggested above.

How about generating a reasonably pronounceable combination from the start? I have done something where I generate a random Soundex code, and work back from that to a (usually) pronounceable original.

If anyone's looking for a way to do this with Node.js, I found a module called pronouncable that seems to implement what Xantix's answer describes.
npm i pronounceable
You can test in without installing anything on RunKit.

Related

I need to find a smallest section of a string that contains all of the characters I'm searching for [duplicate]

I have recently come across an interesting question on strings. Suppose you are given following:
Input string1: "this is a test string"
Input string2: "tist"
Output string: "t stri"
So, given above, how can I approach towards finding smallest substring of string1 that contains all the characters from string 2?
To see more details including working code, check my blog post at:
http://www.leetcode.com/2010/11/finding-minimum-window-in-s-which.html
To help illustrate this approach, I use an example: string1 = "acbbaca" and string2 = "aba". Here, we also use the term "window", which means a contiguous block of characters from string1 (could be interchanged with the term substring).
i) string1 = "acbbaca" and string2 = "aba".
ii) The first minimum window is found.
Notice that we cannot advance begin
pointer as hasFound['a'] ==
needToFind['a'] == 2. Advancing would
mean breaking the constraint.
iii) The second window is found. begin
pointer still points to the first
element 'a'. hasFound['a'] (3) is
greater than needToFind['a'] (2). We
decrement hasFound['a'] by one and
advance begin pointer to the right.
iv) We skip 'c' since it is not found
in string2. Begin pointer now points to 'b'.
hasFound['b'] (2) is greater than
needToFind['b'] (1). We decrement
hasFound['b'] by one and advance begin
pointer to the right.
v) Begin pointer now points to the
next 'b'. hasFound['b'] (1) is equal
to needToFind['b'] (1). We stop
immediately and this is our newly
found minimum window.
The idea is mainly based on the help of two pointers (begin and end position of the window) and two tables (needToFind and hasFound) while traversing string1. needToFind stores the total count of a character in string2 and hasFound stores the total count of a character met so far. We also use a count variable to store the total characters in string2 that's met so far (not counting characters where hasFound[x] exceeds needToFind[x]). When count equals string2's length, we know a valid window is found.
Each time we advance the end pointer (pointing to an element x), we increment hasFound[x] by one. We also increment count by one if hasFound[x] is less than or equal to needToFind[x]. Why? When the constraint is met (that is, count equals to string2's size), we immediately advance begin pointer as far right as possible while maintaining the constraint.
How do we check if it is maintaining the constraint? Assume that begin points to an element x, we check if hasFound[x] is greater than needToFind[x]. If it is, we can decrement hasFound[x] by one and advancing begin pointer without breaking the constraint. On the other hand, if it is not, we stop immediately as advancing begin pointer breaks the window constraint.
Finally, we check if the minimum window length is less than the current minimum. Update the current minimum if a new minimum is found.
Essentially, the algorithm finds the first window that satisfies the constraint, then continue maintaining the constraint throughout.
You can do a histogram sweep in O(N+M) time and O(1) space where N is the number of characters in the first string and M is the number of characters in the second.
It works like this:
Make a histogram of the second string's characters (key operation is hist2[ s2[i] ]++).
Make a cumulative histogram of the first string's characters until that histogram contains every character that the second string's histogram contains (which I will call "the histogram condition").
Then move forwards on the first string, subtracting from the histogram, until it fails to meet the histogram condition. Mark that bit of the first string (before the final move) as your tentative substring.
Move the front of the substring forwards again until you meet the histogram condition again. Move the end forwards until it fails again. If this is a shorter substring than the first, mark that as your tentative substring.
Repeat until you've passed through the entire first string.
The marked substring is your answer.
Note that by varying the check you use on the histogram condition, you can choose either to have the same set of characters as the second string, or at least as many characters of each type. (Its just the difference between a[i]>0 && b[i]>0 and a[i]>=b[i].)
You can speed up the histogram checks if you keep a track of which condition is not satisfied when you're trying to satisfy it, and checking only the thing that you decrement when you're trying to break it. (On the initial buildup, you count how many items you've satisfied, and increment that count every time you add a new character that takes the condition from false to true.)
Here's an O(n) solution. The basic idea is simple: for each starting index, find the least ending index such that the substring contains all of the necessary letters. The trick is that the least ending index increases over the course of the function, so with a little data structure support, we consider each character at most twice.
In Python:
from collections import defaultdict
def smallest(s1, s2):
assert s2 != ''
d = defaultdict(int)
nneg = [0] # number of negative entries in d
def incr(c):
d[c] += 1
if d[c] == 0:
nneg[0] -= 1
def decr(c):
if d[c] == 0:
nneg[0] += 1
d[c] -= 1
for c in s2:
decr(c)
minlen = len(s1) + 1
j = 0
for i in xrange(len(s1)):
while nneg[0] > 0:
if j >= len(s1):
return minlen
incr(s1[j])
j += 1
minlen = min(minlen, j - i)
decr(s1[i])
return minlen
I received the same interview question. I am a C++ candidate but I was in a position to code relatively fast in JAVA.
Java [Courtesy : Sumod Mathilakath]
import java.io.*;
import java.util.*;
class UserMainCode
{
public String GetSubString(String input1,String input2){
// Write code here...
return find(input1, input2);
}
private static boolean containsPatternChar(int[] sCount, int[] pCount) {
for(int i=0;i<256;i++) {
if(pCount[i]>sCount[i])
return false;
}
return true;
}
public static String find(String s, String p) {
if (p.length() > s.length())
return null;
int[] pCount = new int[256];
int[] sCount = new int[256];
// Time: O(p.lenght)
for(int i=0;i<p.length();i++) {
pCount[(int)(p.charAt(i))]++;
sCount[(int)(s.charAt(i))]++;
}
int i = 0, j = p.length(), min = Integer.MAX_VALUE;
String res = null;
// Time: O(s.lenght)
while (j < s.length()) {
if (containsPatternChar(sCount, pCount)) {
if ((j - i) < min) {
min = j - i;
res = s.substring(i, j);
// This is the smallest possible substring.
if(min==p.length())
break;
// Reduce the window size.
sCount[(int)(s.charAt(i))]--;
i++;
}
} else {
sCount[(int)(s.charAt(j))]++;
// Increase the window size.
j++;
}
}
System.out.println(res);
return res;
}
}
C++ [Courtesy : sundeepblue]
#include <iostream>
#include <vector>
#include <string>
#include <climits>
using namespace std;
string find_minimum_window(string s, string t) {
if(s.empty() || t.empty()) return;
int ns = s.size(), nt = t.size();
vector<int> total(256, 0);
vector<int> sofar(256, 0);
for(int i=0; i<nt; i++)
total[t[i]]++;
int L = 0, R;
int minL = 0; //gist2
int count = 0;
int min_win_len = INT_MAX;
for(R=0; R<ns; R++) { // gist0, a big for loop
if(total[s[R]] == 0) continue;
else sofar[s[R]]++;
if(sofar[s[R]] <= total[s[R]]) // gist1, <= not <
count++;
if(count == nt) { // POS1
while(true) {
char c = s[L];
if(total[c] == 0) { L++; }
else if(sofar[c] > total[c]) {
sofar[c]--;
L++;
}
else break;
}
if(R - L + 1 < min_win_len) { // this judge should be inside POS1
min_win_len = R - L + 1;
minL = L;
}
}
}
string res;
if(count == nt) // gist3, cannot forget this.
res = s.substr(minL, min_win_len); // gist4, start from "minL" not "L"
return res;
}
int main() {
string s = "abdccdedca";
cout << find_minimum_window(s, "acd");
}
Erlang [Courtesy : wardbekker]
-module(leetcode).
-export([min_window/0]).
%% Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity O(n).
%% For example,
%% S = "ADOBECODEBANC"
%% T = "ABC"
%% Minimum window is "BANC".
%% Note:
%% If there is no such window in S that covers all characters in T, return the emtpy string "".
%% If there are multiple such windows, you are guaranteed that there will always be only one unique minimum window in S.
min_window() ->
"eca" = min_window("cabeca", "cae"),
"eca" = min_window("cfabeca", "cae"),
"aec" = min_window("cabefgecdaecf", "cae"),
"cwae" = min_window("cabwefgewcwaefcf", "cae"),
"BANC" = min_window("ADOBECODEBANC", "ABC"),
ok.
min_window(T, S) ->
min_window(T, S, []).
min_window([], _T, MinWindow) ->
MinWindow;
min_window([H | Rest], T, MinWindow) ->
NewMinWindow = case lists:member(H, T) of
true ->
MinWindowFound = fullfill_window(Rest, lists:delete(H, T), [H]),
case length(MinWindow) == 0 orelse (length(MinWindow) > length(MinWindowFound)
andalso length(MinWindowFound) > 0) of
true ->
MinWindowFound;
false ->
MinWindow
end;
false ->
MinWindow
end,
min_window(Rest, T, NewMinWindow).
fullfill_window(_, [], Acc) ->
%% window completed
Acc;
fullfill_window([], _T, _Acc) ->
%% no window found
"";
fullfill_window([H | Rest], T, Acc) ->
%% completing window
case lists:member(H, T) of
true ->
fullfill_window(Rest, lists:delete(H, T), Acc ++ [H]);
false ->
fullfill_window(Rest, T, Acc ++ [H])
end.
REF:
http://articles.leetcode.com/finding-minimum-window-in-s-which/#comment-511216
http://www.mif.vu.lt/~valdas/ALGORITMAI/LITERATURA/Cormen/Cormen.pdf
Please have a look at this as well:
//-----------------------------------------------------------------------
bool IsInSet(char ch, char* cSet)
{
char* cSetptr = cSet;
int index = 0;
while (*(cSet+ index) != '\0')
{
if(ch == *(cSet+ index))
{
return true;
}
++index;
}
return false;
}
void removeChar(char ch, char* cSet)
{
bool bShift = false;
int index = 0;
while (*(cSet + index) != '\0')
{
if( (ch == *(cSet + index)) || bShift)
{
*(cSet + index) = *(cSet + index + 1);
bShift = true;
}
++index;
}
}
typedef struct subStr
{
short iStart;
short iEnd;
short szStr;
}ss;
char* subStringSmallest(char* testStr, char* cSet)
{
char* subString = NULL;
int iSzSet = strlen(cSet) + 1;
int iSzString = strlen(testStr)+ 1;
char* cSetBackUp = new char[iSzSet];
memcpy((void*)cSetBackUp, (void*)cSet, iSzSet);
int iStartIndx = -1;
int iEndIndx = -1;
int iIndexStartNext = -1;
std::vector<ss> subStrVec;
int index = 0;
while( *(testStr+index) != '\0' )
{
if (IsInSet(*(testStr+index), cSetBackUp))
{
removeChar(*(testStr+index), cSetBackUp);
if(iStartIndx < 0)
{
iStartIndx = index;
}
else if( iIndexStartNext < 0)
iIndexStartNext = index;
else
;
if (strlen(cSetBackUp) == 0 )
{
iEndIndx = index;
if( iIndexStartNext == -1)
break;
else
{
index = iIndexStartNext;
ss stemp = {iStartIndx, iEndIndx, (iEndIndx-iStartIndx + 1)};
subStrVec.push_back(stemp);
iStartIndx = iEndIndx = iIndexStartNext = -1;
memcpy((void*)cSetBackUp, (void*)cSet, iSzSet);
continue;
}
}
}
else
{
if (IsInSet(*(testStr+index), cSet))
{
if(iIndexStartNext < 0)
iIndexStartNext = index;
}
}
++index;
}
int indexSmallest = 0;
for(int indexVec = 0; indexVec < subStrVec.size(); ++indexVec)
{
if(subStrVec[indexSmallest].szStr > subStrVec[indexVec].szStr)
indexSmallest = indexVec;
}
subString = new char[(subStrVec[indexSmallest].szStr) + 1];
memcpy((void*)subString, (void*)(testStr+ subStrVec[indexSmallest].iStart), subStrVec[indexSmallest].szStr);
memset((void*)(subString + subStrVec[indexSmallest].szStr), 0, 1);
delete[] cSetBackUp;
return subString;
}
//--------------------------------------------------------------------
Edit: apparently there's an O(n) algorithm (cf. algorithmist's answer). Obviously this have this will beat the [naive] baseline described below!
Too bad I gotta go... I'm a bit suspicious that we can get O(n). I'll check in tomorrow to see the winner ;-) Have fun!
Tentative algorithm:
The general idea is to sequentially try and use a character from str2 found in str1 as the start of a search (in either/both directions) of all the other letters of str2. By keeping a "length of best match so far" value, we can abort searches when they exceed this. Other heuristics can probably be used to further abort suboptimal (so far) solutions. The choice of the order of the starting letters in str1 matters much; it is suggested to start with the letter(s) of str1 which have the lowest count and to try with the other letters, of an increasing count, in subsequent attempts.
[loose pseudo-code]
- get count for each letter/character in str1 (number of As, Bs etc.)
- get count for each letter in str2
- minLen = length(str1) + 1 (the +1 indicates you're not sure all chars of
str2 are in str1)
- Starting with the letter from string2 which is found the least in string1,
look for other letters of Str2, in either direction of str1, until you've
found them all (or not, at which case response = impossible => done!).
set x = length(corresponding substring of str1).
- if (x < minLen),
set minlen = x,
also memorize the start/len of the str1 substring.
- continue trying with other letters of str1 (going the up the frequency
list in str1), but abort search as soon as length(substring of strl)
reaches or exceed minLen.
We can find a few other heuristics that would allow aborting a
particular search, based on [pre-calculated ?] distance between a given
letter in str1 and some (all?) of the letters in str2.
- the overall search terminates when minLen = length(str2) or when
we've used all letters of str1 (which match one letter of str2)
as a starting point for the search
Here is Java implementation
public static String shortestSubstrContainingAllChars(String input, String target) {
int needToFind[] = new int[256];
int hasFound[] = new int[256];
int totalCharCount = 0;
String result = null;
char[] targetCharArray = target.toCharArray();
for (int i = 0; i < targetCharArray.length; i++) {
needToFind[targetCharArray[i]]++;
}
char[] inputCharArray = input.toCharArray();
for (int begin = 0, end = 0; end < inputCharArray.length; end++) {
if (needToFind[inputCharArray[end]] == 0) {
continue;
}
hasFound[inputCharArray[end]]++;
if (hasFound[inputCharArray[end]] <= needToFind[inputCharArray[end]]) {
totalCharCount ++;
}
if (totalCharCount == target.length()) {
while (needToFind[inputCharArray[begin]] == 0
|| hasFound[inputCharArray[begin]] > needToFind[inputCharArray[begin]]) {
if (hasFound[inputCharArray[begin]] > needToFind[inputCharArray[begin]]) {
hasFound[inputCharArray[begin]]--;
}
begin++;
}
String substring = input.substring(begin, end + 1);
if (result == null || result.length() > substring.length()) {
result = substring;
}
}
}
return result;
}
Here is the Junit Test
#Test
public void shortestSubstringContainingAllCharsTest() {
String result = StringUtil.shortestSubstrContainingAllChars("acbbaca", "aba");
assertThat(result, equalTo("baca"));
result = StringUtil.shortestSubstrContainingAllChars("acbbADOBECODEBANCaca", "ABC");
assertThat(result, equalTo("BANC"));
result = StringUtil.shortestSubstrContainingAllChars("this is a test string", "tist");
assertThat(result, equalTo("t stri"));
}
//[ShortestSubstring.java][1]
public class ShortestSubstring {
public static void main(String[] args) {
String input1 = "My name is Fran";
String input2 = "rim";
System.out.println(getShortestSubstring(input1, input2));
}
private static String getShortestSubstring(String mainString, String toBeSearched) {
int mainStringLength = mainString.length();
int toBeSearchedLength = toBeSearched.length();
if (toBeSearchedLength > mainStringLength) {
throw new IllegalArgumentException("search string cannot be larger than main string");
}
for (int j = 0; j < mainStringLength; j++) {
for (int i = 0; i <= mainStringLength - toBeSearchedLength; i++) {
String substring = mainString.substring(i, i + toBeSearchedLength);
if (checkIfMatchFound(substring, toBeSearched)) {
return substring;
}
}
toBeSearchedLength++;
}
return null;
}
private static boolean checkIfMatchFound(String substring, String toBeSearched) {
char[] charArraySubstring = substring.toCharArray();
char[] charArrayToBeSearched = toBeSearched.toCharArray();
int count = 0;
for (int i = 0; i < charArraySubstring.length; i++) {
for (int j = 0; j < charArrayToBeSearched.length; j++) {
if (String.valueOf(charArraySubstring[i]).equalsIgnoreCase(String.valueOf(charArrayToBeSearched[j]))) {
count++;
}
}
}
return count == charArrayToBeSearched.length;
}
}
This is an approach using prime numbers to avoid one loop, and replace it with multiplications. Several other minor optimizations can be made.
Assign a unique prime number to any of the characters that you want to find, and 1 to the uninteresting characters.
Find the product of a matching string by multiplying the prime number with the number of occurrences it should have. Now this product can only be found if the same prime factors are used.
Search the string from the beginning, multiplying the respective prime number as you move into a running product.
If the number is greater than the correct sum, remove the first character and divide its prime number out of your running product.
If the number is less than the correct sum, include the next character and multiply it into your running product.
If the number is the same as the correct sum you have found a match, slide beginning and end to next character and continue searching for other matches.
Decide which of the matches is the shortest.
Gist
charcount = { 'a': 3, 'b' : 1 };
str = "kjhdfsbabasdadaaaaasdkaaajbajerhhayeom"
def find (c, s):
Ns = len (s)
C = list (c.keys ())
D = list (c.values ())
# prime numbers assigned to the first 25 chars
prmsi = [ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89 , 97]
# primes used in the key, all other set to 1
prms = []
Cord = [ord(c) - ord('a') for c in C]
for e,p in enumerate(prmsi):
if e in Cord:
prms.append (p)
else:
prms.append (1)
# Product of match
T = 1
for c,d in zip(C,D):
p = prms[ord (c) - ord('a')]
T *= p**d
print ("T=", T)
t = 1 # product of current string
f = 0
i = 0
matches = []
mi = 0
mn = Ns
mm = 0
while i < Ns:
k = prms[ord(s[i]) - ord ('a')]
t *= k
print ("testing:", s[f:i+1])
if (t > T):
# included too many chars: move start
t /= prms[ord(s[f]) - ord('a')] # remove first char, usually division by 1
f += 1 # increment start position
t /= k # will be retested, could be replaced with bool
elif t == T:
# found match
print ("FOUND match:", s[f:i+1])
matches.append (s[f:i+1])
if (i - f) < mn:
mm = mi
mn = i - f
mi += 1
t /= prms[ord(s[f]) - ord('a')] # remove first matching char
# look for next match
i += 1
f += 1
else:
# no match yet, keep searching
i += 1
return (mm, matches)
print (find (charcount, str))
(note: this answer was originally posted to a duplicate question, the original answer is now deleted.)
C# Implementation:
public static Tuple<int, int> FindMinSubstringWindow(string input, string pattern)
{
Tuple<int, int> windowCoords = new Tuple<int, int>(0, input.Length - 1);
int[] patternHist = new int[256];
for (int i = 0; i < pattern.Length; i++)
{
patternHist[pattern[i]]++;
}
int[] inputHist = new int[256];
int minWindowLength = int.MaxValue;
int count = 0;
for (int begin = 0, end = 0; end < input.Length; end++)
{
// Skip what's not in pattern.
if (patternHist[input[end]] == 0)
{
continue;
}
inputHist[input[end]]++;
// Count letters that are in pattern.
if (inputHist[input[end]] <= patternHist[input[end]])
{
count++;
}
// Window found.
if (count == pattern.Length)
{
// Remove extra instances of letters from pattern
// or just letters that aren't part of the pattern
// from the beginning.
while (patternHist[input[begin]] == 0 ||
inputHist[input[begin]] > patternHist[input[begin]])
{
if (inputHist[input[begin]] > patternHist[input[begin]])
{
inputHist[input[begin]]--;
}
begin++;
}
// Current window found.
int windowLength = end - begin + 1;
if (windowLength < minWindowLength)
{
windowCoords = new Tuple<int, int>(begin, end);
minWindowLength = windowLength;
}
}
}
if (count == pattern.Length)
{
return windowCoords;
}
return null;
}
I've implemented it using Python3 at O(N) efficiency:
def get(s, alphabet="abc"):
seen = {}
for c in alphabet:
seen[c] = 0
seen[s[0]] = 1
start = 0
end = 0
shortest_s = 0
shortest_e = 99999
while end + 1 < len(s):
while seen[s[start]] > 1:
seen[s[start]] -= 1
start += 1
# Constant time check:
if sum(seen.values()) == len(alphabet) and all(v == 1 for v in seen.values()) and \
shortest_e - shortest_s > end - start:
shortest_s = start
shortest_e = end
end += 1
seen[s[end]] += 1
return s[shortest_s: shortest_e + 1]
print(get("abbcac")) # Expected to return "bca"
String s = "xyyzyzyx";
String s1 = "xyz";
String finalString ="";
Map<Character,Integer> hm = new HashMap<>();
if(s1!=null && s!=null && s.length()>s1.length()){
for(int i =0;i<s1.length();i++){
if(hm.get(s1.charAt(i))!=null){
int k = hm.get(s1.charAt(i))+1;
hm.put(s1.charAt(i), k);
}else
hm.put(s1.charAt(i), 1);
}
Map<Character,Integer> t = new HashMap<>();
int start =-1;
for(int j=0;j<s.length();j++){
if(hm.get(s.charAt(j))!=null){
if(t.get(s.charAt(j))!=null){
if(t.get(s.charAt(j))!=hm.get(s.charAt(j))){
int k = t.get(s.charAt(j))+1;
t.put(s.charAt(j), k);
}
}else{
t.put(s.charAt(j), 1);
if(start==-1){
if(j+s1.length()>s.length()){
break;
}
start = j;
}
}
if(hm.equals(t)){
t = new HashMap<>();
if(finalString.length()<s.substring(start,j+1).length());
{
finalString=s.substring(start,j+1);
}
j=start;
start=-1;
}
}
}
JavaScript solution in bruteforce way:
function shortestSubStringOfUniqueChars(s){
var uniqueArr = [];
for(let i=0; i<s.length; i++){
if(uniqueArr.indexOf(s.charAt(i)) <0){
uniqueArr.push(s.charAt(i));
}
}
let windoww = uniqueArr.length;
while(windoww < s.length){
for(let i=0; i<s.length - windoww; i++){
let match = true;
let tempArr = [];
for(let j=0; j<uniqueArr.length; j++){
if(uniqueArr.indexOf(s.charAt(i+j))<0){
match = false;
break;
}
}
let checkStr
if(match){
checkStr = s.substr(i, windoww);
for(let j=0; j<uniqueArr.length; j++){
if(uniqueArr.indexOf(checkStr.charAt(j))<0){
match = false;
break;
}
}
}
if(match){
return checkStr;
}
}
windoww = windoww + 1;
}
}
console.log(shortestSubStringOfUniqueChars("ABA"));
# Python implementation
s = input('Enter the string : ')
s1 = input('Enter the substring to search : ')
l = [] # List to record all the matching combinations
check = all([char in s for char in s1])
if check == True:
for i in range(len(s1),len(s)+1) :
for j in range(0,i+len(s1)+2):
if (i+j) < len(s)+1:
cnt = 0
b = all([char in s[j:i+j] for char in s1])
if (b == True) :
l.append(s[j:i+j])
print('The smallest substring containing',s1,'is',l[0])
else:
print('Please enter a valid substring')
Java code for the approach discussed above:
private static Map<Character, Integer> frequency;
private static Set<Character> charsCovered;
private static Map<Character, Integer> encountered;
/**
* To set the first match index as an intial start point
*/
private static boolean hasStarted = false;
private static int currentStartIndex = 0;
private static int finalStartIndex = 0;
private static int finalEndIndex = 0;
private static int minLen = Integer.MAX_VALUE;
private static int currentLen = 0;
/**
* Whether we have already found the match and now looking for other
* alternatives.
*/
private static boolean isFound = false;
private static char currentChar;
public static String findSmallestSubStringWithAllChars(String big, String small) {
if (null == big || null == small || big.isEmpty() || small.isEmpty()) {
return null;
}
frequency = new HashMap<Character, Integer>();
instantiateFrequencyMap(small);
charsCovered = new HashSet<Character>();
int charsToBeCovered = frequency.size();
encountered = new HashMap<Character, Integer>();
for (int i = 0; i < big.length(); i++) {
currentChar = big.charAt(i);
if (frequency.containsKey(currentChar) && !isFound) {
if (!hasStarted && !isFound) {
hasStarted = true;
currentStartIndex = i;
}
updateEncounteredMapAndCharsCoveredSet(currentChar);
if (charsCovered.size() == charsToBeCovered) {
currentLen = i - currentStartIndex;
isFound = true;
updateMinLength(i);
}
} else if (frequency.containsKey(currentChar) && isFound) {
updateEncounteredMapAndCharsCoveredSet(currentChar);
if (currentChar == big.charAt(currentStartIndex)) {
encountered.put(currentChar, encountered.get(currentChar) - 1);
currentStartIndex++;
while (currentStartIndex < i) {
if (encountered.containsKey(big.charAt(currentStartIndex))
&& encountered.get(big.charAt(currentStartIndex)) > frequency.get(big
.charAt(currentStartIndex))) {
encountered.put(big.charAt(currentStartIndex),
encountered.get(big.charAt(currentStartIndex)) - 1);
} else if (encountered.containsKey(big.charAt(currentStartIndex))) {
break;
}
currentStartIndex++;
}
}
currentLen = i - currentStartIndex;
updateMinLength(i);
}
}
System.out.println("start: " + finalStartIndex + " finalEnd : " + finalEndIndex);
return big.substring(finalStartIndex, finalEndIndex + 1);
}
private static void updateMinLength(int index) {
if (minLen > currentLen) {
minLen = currentLen;
finalStartIndex = currentStartIndex;
finalEndIndex = index;
}
}
private static void updateEncounteredMapAndCharsCoveredSet(Character currentChar) {
if (encountered.containsKey(currentChar)) {
encountered.put(currentChar, encountered.get(currentChar) + 1);
} else {
encountered.put(currentChar, 1);
}
if (encountered.get(currentChar) >= frequency.get(currentChar)) {
charsCovered.add(currentChar);
}
}
private static void instantiateFrequencyMap(String str) {
for (char c : str.toCharArray()) {
if (frequency.containsKey(c)) {
frequency.put(c, frequency.get(c) + 1);
} else {
frequency.put(c, 1);
}
}
}
public static void main(String[] args) {
String big = "this is a test string";
String small = "tist";
System.out.println("len: " + big.length());
System.out.println(findSmallestSubStringWithAllChars(big, small));
}
def minimum_window(s, t, min_length = 100000):
d = {}
for x in t:
if x in d:
d[x]+= 1
else:
d[x] = 1
tot = sum([y for x,y in d.iteritems()])
l = []
ind = 0
for i,x in enumerate(s):
if ind == 1:
l = l + [x]
if x in d:
tot-=1
if not l:
ind = 1
l = [x]
if tot == 0:
if len(l)<min_length:
min_length = len(l)
min_length = minimum_window(s[i+1:], t, min_length)
return min_length
l_s = "ADOBECODEBANC"
t_s = "ABC"
min_length = minimum_window(l_s, t_s)
if min_length == 100000:
print "Not found"
else:
print min_length

Balanced word wrap (Minimum raggedness) in PHP

I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters (n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters) has to be much balanced as possible across lines.
Example of input text:
How to do things
Wrong output (this is the normal word-wrap behavior), m=6:
How to
do
things
Desired output, always m=6:
How
to do
things
Does anyone have suggestions or guidelines on how to implement this function? Basically, I'm searching something for pretty print short phrases on two or three (as much as possible) equal length lines.
Update: It seems I'm searching exactly for a Minimum raggedness word wrap algorithm. But I can't find any implementation in a real programming language (anyone, then I can convert it in PHP).
Update 2: I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searching procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm.
I've implemented on the same lines of Alex, coding the Wikipedia algorithm, but directly in PHP (an interesting exercise to me). Understanding how to use the optimal cost function f(j), i.e. the 'recurrence' part, is not very easy. Thanks to Alex for the well commented code.
/**
* minimumRaggedness
*
* #param string $input paragraph. Each word separed by 1 space.
* #param int $LineWidth the max chars per line.
* #param string $lineBreak wrapped lines separator.
*
* #return string $output the paragraph wrapped.
*/
function minimumRaggedness($input, $LineWidth, $lineBreak = "\n")
{
$words = explode(" ", $input);
$wsnum = count($words);
$wslen = array_map("strlen", $words);
$inf = 1000000; //PHP_INT_MAX;
// keep Costs
$C = array();
for ($i = 0; $i < $wsnum; ++$i)
{
$C[] = array();
for ($j = $i; $j < $wsnum; ++$j)
{
$l = 0;
for ($k = $i; $k <= $j; ++$k)
$l += $wslen[$k];
$c = $LineWidth - ($j - $i) - $l;
if ($c < 0)
$c = $inf;
else
$c = $c * $c;
$C[$i][$j] = $c;
}
}
// apply recurrence
$F = array();
$W = array();
for ($j = 0; $j < $wsnum; ++$j)
{
$F[$j] = $C[0][$j];
$W[$j] = 0;
if ($F[$j] == $inf)
{
for ($k = 0; $k < $j; ++$k)
{
$t = $F[$k] + $C[$k + 1][$j];
if ($t < $F[$j])
{
$F[$j] = $t;
$W[$j] = $k + 1;
}
}
}
}
// rebuild wrapped paragraph
$output = "";
if ($F[$wsnum - 1] < $inf)
{
$S = array();
$j = $wsnum - 1;
for ( ; ; )
{
$S[] = $j;
$S[] = $W[$j];
if ($W[$j] == 0)
break;
$j = $W[$j] - 1;
}
$pS = count($S) - 1;
do
{
$i = $S[$pS--];
$j = $S[$pS--];
for ($k = $i; $k < $j; $k++)
$output .= $words[$k] . " ";
$output .= $words[$k] . $lineBreak;
}
while ($j < $wsnum - 1);
}
else
$output = $input;
return $output;
}
?>
Quick and dirty, in c++
#include <sstream>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <memory.h>
using namespace std;
int cac[1000][1000];
string res[1000][1000];
vector<string> words;
int M;
int go(int a, int b){
if(cac[a][b]>= 0) return cac[a][b];
if(a == b) return 0;
int csum = -1;
for(int i=a; i<b; ++i){
csum += words[i].size() + 1;
}
if(csum <= M || a == b-1){
string sep = "";
for(int i=a; i<b; ++i){
res[a][b].append(sep);
res[a][b].append(words[i]);
sep = " ";
}
return cac[a][b] = (M-csum)*(M-csum);
}
int ret = 1000000000;
int best_sp = -1;
for(int sp=a+1; sp<b; ++sp){
int cur = go(a, sp) + go(sp,b);
if(cur <= ret){
ret = cur;
best_sp = sp;
}
}
res[a][b] = res[a][best_sp] + "\n" + res[best_sp][b];
return cac[a][b] = ret;
}
int main(int argc, char ** argv){
memset(cac, -1, sizeof(cac));
M = atoi(argv[1]);
string word;
while(cin >> word) words.push_back(word);
go(0, words.size());
cout << res[0][words.size()] << endl;
}
Test:
$ echo "The quick brown fox jumps over a lazy dog" |./a.out 10
The quick
brown fox
jumps over
a lazy dog
EDIT: just looked at the wikipedia page for minimum raggedness word wrap. Changed algorithm to the given one (with squared penalties)
A C version:
// This is a direct implementation of the minimum raggedness word wrapping
// algorithm from http://en.wikipedia.org/wiki/Word_wrap#Minimum_raggedness
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <limits.h>
const char* pText = "How to do things";
int LineWidth = 6;
int WordCnt;
const char** pWords;
int* pWordLengths;
int* pC;
int* pF;
int* pW;
int* pS;
int CountWords(const char* p)
{
int cnt = 0;
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
cnt++;
while (*p != '\0' && !isspace(*p)) p++;
}
}
return cnt;
}
void FindWords(const char* p, int cnt, const char** pWords, int* pWordLengths)
{
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
*pWords++ = p;
while (*p != '\0' && !isspace(*p)) p++;
*pWordLengths++ = p - pWords[-1];
}
}
}
void PrintWord(const char* p, int l)
{
int i;
for (i = 0; i < l; i++)
printf("%c", p[i]);
}
// 1st program's argument is the text
// 2nd program's argument is the line width
int main(int argc, char* argv[])
{
int i, j;
if (argc >= 3)
{
pText = argv[1];
LineWidth = atoi(argv[2]);
}
WordCnt = CountWords(pText);
pWords = malloc(WordCnt * sizeof(*pWords));
pWordLengths = malloc(WordCnt * sizeof(*pWordLengths));
FindWords(pText, WordCnt, pWords, pWordLengths);
printf("Input Text: \"%s\"\n", pText);
printf("Line Width: %d\n", LineWidth);
printf("Words : %d\n", WordCnt);
#if 0
for (i = 0; i < WordCnt; i++)
{
printf("\"");
PrintWord(pWords[i], pWordLengths[i]);
printf("\"\n");
}
#endif
// Build c(i,j) in pC[]
pC = malloc(WordCnt * WordCnt * sizeof(int));
for (i = 0; i < WordCnt; i++)
{
for (j = 0; j < WordCnt; j++)
if (j >= i)
{
int k;
int c = LineWidth - (j - i);
for (k = i; k <= j; k++) c -= pWordLengths[k];
c = (c >= 0) ? c * c : INT_MAX;
pC[j * WordCnt + i] = c;
}
else
pC[j * WordCnt + i] = INT_MAX;
}
// Build f(j) in pF[] and store the wrap points in pW[]
pF = malloc(WordCnt * sizeof(int));
pW = malloc(WordCnt * sizeof(int));
for (j = 0; j < WordCnt; j++)
{
pW[j] = 0;
if ((pF[j] = pC[j * WordCnt]) == INT_MAX)
{
int k;
for (k = 0; k < j; k++)
{
int s;
if (pF[k] == INT_MAX || pC[j * WordCnt + k + 1] == INT_MAX)
s = INT_MAX;
else
s = pF[k] + pC[j * WordCnt + k + 1];
if (pF[j] > s)
{
pF[j] = s;
pW[j] = k + 1;
}
}
}
}
// Print the optimal solution cost
printf("f : %d\n", pF[WordCnt - 1]);
// Print the optimal solution, if any
pS = malloc(2 * WordCnt * sizeof(int));
if (pF[WordCnt - 1] != INT_MAX)
{
// Work out the solution's words by back tracking the
// wrap points from pW[] and store them on the pS[] stack
j = WordCnt - 1;
for (;;)
{
*pS++ = j;
*pS++ = pW[j];
if (!pW[j]) break;
j = pW[j] - 1;
}
// Print the solution line by line, word by word
// in direct order
do
{
int k;
i = *--pS;
j = *--pS;
for (k = i; k <= j; k++)
{
PrintWord(pWords[k], pWordLengths[k]);
printf(" ");
}
printf("\n");
} while (j < WordCnt - 1);
}
return 0;
}
Output 1:
ww.exe
Input Text: "How to do things"
Line Width: 6
Words : 4
f : 10
How
to do
things
Output 2:
ww.exe "aaa bb cc ddddd" 6
Input Text: "aaa bb cc ddddd"
Line Width: 6
Words : 4
f : 11
aaa
bb cc
ddddd
Output 3:
ww.exe "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm." 60
Input Text: "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm."
Line Width: 60
Words : 73
f : 241
I started a bounty for this. Is it possible that do not
exist any public implementation of Minimum raggedness
algorithm in any procedural language? I need something
written in a way that can be translated into procedural
instructions. All I can find now is just a bounch of
(generic) equation that however need a optimal searhing
procedure. I will be grateful also for an implementation
that can only approximate that optimal searching algorithm.
I think the simplest way to look at it - is with iteration between limits
E.g.
/**
* balancedWordWrap
*
* #param string $input
* #param int $maxWidth the max chars per line
*/
function balancedWordWrap($input, $maxWidth = null) {
$length = strlen($input);
if (!$maxWidth) {
$maxWidth = min(ceil($length / 2), 75);
}
$minWidth = min(ceil($length / 2), $maxWidth / 2);
$permutations = array();
$scores = array();
$lowestScore = 999;
$lowest = $minWidth;
foreach(range($minWidth, $maxWidth) as $width) {
$permutations[$width] = wordwrap($input, $width);
$lines = explode("\n", $permutations[$width]);
$max = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
if ($lineLength > $max) {
$max = $lineLength;
}
}
$score = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
$score += pow($max - $lineLength, 2);
}
$scores[$width] = $score;
if ($score < $lowestScore) {
$lowestScore = $score;
$lowest = $width;
}
}
return $permutations[$lowest];
}
Given the input "how to do things"
it outputs
How
to do
things
Given the input "Mary had a little lamb"
it outputs
Mary had a
little lamb
Given the input "This extra-long paragraph was writtin to demonstrate how the fmt(1) program handles longer inputs. When testing inputs, you don\'t want them to be too short, nor too long, because the quality of the program can only be determined upon inspection of complex content. The quick brown fox jumps over the lazy dog. Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.", and limited to 75 chars max width, it outputs:
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
be too short, nor too long, because the quality of the program can only be
determined upon inspection of complex content. The quick brown fox jumps
over the lazy dog. Congress shall make no law respecting an establishment
of religion, or prohibiting the free exercise thereof; or abridging the
freedom of speech, or of the press; or the right of the people peaceably
to assemble, and to petition the Government for a redress of grievances.
Justin's link to Knuth's Breaking Paragraphs Into Lines is the historically best answer. (Newer systems also apply microtypography techniques such as fiddling with character widths, kerning, and so on, but if you're simply looking for monospaced plain-text, these extra approaches won't help.)
If you just want to solve the problem, the fmt(1) utility supplied on many Linux systems by the Free Software Foundation implements a variant of Knuth's algorithm that also attempts to avoid line breaks at the end of sentences. I wrote your inputs and a larger example, and ran them through fmt -w 20 to force 20-character lines:
$ fmt -w 20 input
Lorem ipsum dolor
sit amet
Supercalifragilisticexpialidocious
and some other
small words
One long
extra-long-word
This extra-long
paragraph
was writtin to
demonstrate how the
`fmt(1)` program
handles longer
inputs. When
testing inputs,
you don't want them
to be too short,
nor too long,
because the quality
of the program can
only be determined
upon inspection
of complex
content. The quick
brown fox jumps
over the lazy
dog. Congress
shall make no
law respecting
an establishment
of religion, or
prohibiting the
free exercise
thereof; or
abridging the
freedom of speech,
or of the press;
or the right of the
people peaceably
to assemble,
and to petition
the Government
for a redress of
grievances.
The output looks much better if you allow it the default 75 characters width for non-trivial input:
$ fmt input
Lorem ipsum dolor sit amet
Supercalifragilisticexpialidocious and some other small words
One long extra-long-word
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
to be too short, nor too long, because the quality of the program can
only be determined upon inspection of complex content. The quick brown
fox jumps over the lazy dog. Congress shall make no law respecting an
establishment of religion, or prohibiting the free exercise thereof;
or abridging the freedom of speech, or of the press; or the right of
the people peaceably to assemble, and to petition the Government for a
redress of grievances.
Here is a bash version:
#! /bin/sh
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
echo "Usage: balance <width> [ <string> ]"
echo " "
echo " if string is not passed as parameter it will be read from STDIN\n"
exit 2
elif [ $# -le 1 ] ; then
LINE=`cat`
else
LINE="$2"
fi
LINES=`echo "$LINE" | fold -s -w $1 | wc -l`
MAX=$1
MIN=0
while [ $MAX -gt $(($MIN+1)) ]
do
TRY=$(( $MAX + $MIN >> 1 ))
NUM=`echo "$LINE" | fold -s -w $TRY | wc -l`
if [ $NUM -le $LINES ] ; then
MAX=$TRY
else
MIN=$TRY
fi
done
echo "$LINE" | fold -s -w $MAX
example:
$ balance 50 "Now is the time for all good men to come to the aid of the party."
Now is the time for all good men
to come to the aid of the party.
Requires 'fold' and 'wc' which are usually available where bash is installed.

function to generate 26 (A-Z) * 26 (A-Z) * 26 (A-Z)

I'd like to create a function which will generate a 3 char key string after each loop.
There are 26 chars in the alphabet and I'd like to generate totally unique 3 char keys (A-Z).
The output would be 17,576 unique 3 char keys (A-Z) - not case sensitive.
Can anyone give me an idea on how to create a more elegant function without randomly generating keys and checking for duplicate keys?
Is it possible
Thank you.
If I read your question correctly you want to have a function that returns a key of the form "ABC" where each of the letters is selected randomly but the same combination of letters is never issued twice.
Do you mean never issued twice per execution of the code or never issued twice "ever"?
Either way I would look at generating all possible combinations, shuffling them and then storing them either in a member array or a file/database depending on your needs. You will also need to store an index which you simply increment each time you issue a key.
<?php
private $keys = array();
private $keyIndex = -1;
function generateKeys()
{
$this->keys = array();
$this->keyIndex = -1;
for ($i=1; $i<=26; $i++)
{
for ($j=1; $j<=26; $j++)
{
for ($k=1; $k<=26; $k++)
{
$this->keys[] = sprintf('%c%c%c', 64+$i, 64+$j, 64+$k);
}
}
}
shuffle($this->keys);
return $this->keys;
}
function getKey()
{
$this->keyIndex++;
return $this->keys[$this->keyIndex];
}
?>
var use any math library to generate a random number from 0 to 26 cubed - 1, then assign the letters based on the value of that random integer...
in the following, the backslash represents integer division, i.e., drop any fractional remainder
first character = ascii(65 + integer modulus 26)
second character = ascii(65 + (integer \ 26) modulus 26
third character = ascii(65 + ((integer \ 26) \ 26) modulus 26
ahh, thx to #Graphain I realize you want to eliminate any chance of picking the same three character combination again... welll then here's a way...
Create a collection (List?) containing 676 (26*26) 32 bit integers, all initialized to 2^26-1 (so bits 0-25 are all set = 1). Put 26 of these integers into each of 26 inner dictionaries so this becomes an dictionary of 26 dictionaries each of which has 26 of these integers. Label the inner dictionaries A-Z. Within each inner array, label the integers A-Z.
Randomly pick one of 26 outer arrays (this sets the first character).
From the array chosen randomly pick one of it's contained inner arrays. This sets the second character.
Then randomly pick a number from 0 to n, (where n is the count of bits in the integer that are still set to 1)... Which bit in the number determines the last character.
Set that bit to zero
If all bits in the integer have been set to zero, remove integer from array
If this inner array is now empty, (all integers are gone) remove it from outer array.
Repeat from step 2 until outer array is empty or you get tired...
Here's some sample code (not tested):
public class RandomAlphaTriplet
{
private static readonly Dictionary<char, Dictionary<char, int>> vals =
new Dictionary<char, Dictionary<char, int>>();
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random rnd = new Random(DateTime.Now.Millisecond);
private const int initVal = 0x3FFFFFF;
static RandomAlphaTriplet()
{
foreach (var c in chars)
vals.Add(c, chars.ToDictionary(
ic => ic, ic => initVal));
}
public static string FetchNext()
{
var c1 = chars[rnd.Next(vals.Count)];
var inrDict = vals[c1];
var c2 = chars[rnd.Next(inrDict.Count)];
var intVal = inrDict[c2];
var bitNo = rnd.Next(BitCount(intVal));
var bitPos = 0;
while (bitNo > 0)
{
if ((intVal & 0x0001) > 0) bitNo--;
bitPos++;
intVal <<= 1;
}
var c3 = chars[bitPos];
inrDict[c2] &= ~(1 << bitPos);
if (inrDict[c2] == 0) inrDict.Remove(c2);
if (vals[c1].Count == 0) vals.Remove(c1);
return string.Concat(c1, c2, c3);
}
private static int BitCount(int x)
{ return ((x == 0) ? 0 : ((x < 0) ? 1 : 0) + BitCount(x << 1)); }
}
I can make it to only one key if you are interested. (I don't know if it is possible at all to have no key to check for duplication).
function getAlpha($Index) {
$Alphabets = 'abcdefghijklmnopqrstuvwxyz';
$Char = substr($Alphabets, $Index, 1);
return $Char;
}
function getRandom($Index) {
$RandOne = $Index % 26;
$RandTwo = ($Index / 26 ) % 26;
$RandThree = ($Index / 26*26) % 26;
$AlphaOne = getAlpha($RandOne);
$AlphaTwo = getAlpha($RandTwo);
$AlphaThree = getAlpha($RandThree);
$Rand = $AlphaOne.$AlphaTwo.$AlphaThree;
return $Rand;
}
$Rands = array();
function getRandom() {
global $Rands;
while (true) {
$RandNum = rand(0, 26*26*26 - 1);
if (isset($Rands[$RandNum]))
continue;
$Rand = getRandom($RandNum);
$Rands[$RandNum] = $Rand;
return $Rand;
}
}
This only need one key to look up.
Hope this helps.
Ehrm is this not “simple” combinatorial math? (Select all combinations of 3 out of 26? Or permutations of 3 out of 26?)

finding common prefix of array of strings

I have an array like this:
$sports = array(
'Softball - Counties',
'Softball - Eastern',
'Softball - North Harbour',
'Softball - South',
'Softball - Western'
);
I would like to find the longest common prefix of the string. In this instance, it would be 'Softball - '
I am thinking that I would follow this process
$i = 1;
// loop to the length of the first string
while ($i < strlen($sports[0]) {
// grab the left most part up to i in length
$match = substr($sports[0], 0, $i);
// loop through all the values in array, and compare if they match
foreach ($sports as $sport) {
if ($match != substr($sport, 0, $i) {
// didn't match, return the part that did match
return substr($sport, 0, $i-1);
}
} // foreach
// increase string length
$i++;
} // while
// if you got to here, then all of them must be identical
Questions
Is there a built in function or much simpler way of doing this ?
For my 5 line array that is probably fine, but if I were to do several thousand line arrays, there would be a lot of overhead, so I would have to be move calculated with my starting values of $i, eg $i = halfway of string, if it fails, then $i/2 until it works, then increment $i by 1 until we succeed. So that we are doing the least number of comparisons to get a result.
Is there a formula/algorithm out already out there for this kind of problem?
If you can sort your array, then there is a simple and very fast solution.
Simply compare the first item to the last one.
If the strings are sorted, any prefix common to all strings will be common to the sorted first and last strings.
sort($sport);
$s1 = $sport[0]; // First string
$s2 = $sport[count($sport)-1]; // Last string
$len = min(strlen($s1), strlen($s2));
// While we still have string to compare,
// if the indexed character is the same in both strings,
// increment the index.
for ($i=0; $i<$len && $s1[$i]==$s2[$i]; $i++);
$prefix = substr($s1, 0, $i);
I would use this:
$prefix = array_shift($array); // take the first item as initial prefix
$length = strlen($prefix);
// compare the current prefix with the prefix of the same length of the other items
foreach ($array as $item) {
// check if there is a match; if not, decrease the prefix by one character at a time
while ($length && substr($item, 0, $length) !== $prefix) {
$length--;
$prefix = substr($prefix, 0, -1);
}
if (!$length) {
break;
}
}
Update  
Here’s another solution, iteratively comparing each n-th character of the strings until a mismatch is found:
$pl = 0; // common prefix length
$n = count($array);
$l = strlen($array[0]);
while ($pl < $l) {
$c = $array[0][$pl];
for ($i=1; $i<$n; $i++) {
if ($array[$i][$pl] !== $c) break 2;
}
$pl++;
}
$prefix = substr($array[0], 0, $pl);
This is even more efficient as there are only at most numberOfStrings‍·‍commonPrefixLength atomic comparisons.
I implemented #diogoriba algorithm into code, with this result:
Finding the common prefix of the first two strings, and then comparing this with all following strings starting from the 3rd, and trim the common string if nothing common is found, wins in situations where there is more in common in the prefixes than different.
But bumperbox's original algorithm (except the bugfixes) wins where the strings have less in common in their prefix than different. Details in the code comments!
Another idea I implemented:
First check for the shortest string in the array, and use this for comparison rather than simply the first string. In the code, this is implemented with the custom written function arrayStrLenMin().
Can bring down iterations dramatically, but the function arrayStrLenMin() may itself cause ( more or less) iterations.
Simply starting with the length of first string in array seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.
Get the maximum common prefix of strings in an array with as little iterations as possible (PHP)
Code + Extensive Testing + Remarks:
function arrayStrLenMin ($arr, $strictMode = false, $forLoop = false) {
$errArrZeroLength = -1; // Return value for error: Array is empty
$errOtherType = -2; // Return value for error: Found other type (than string in array)
$errStrNone = -3; // Return value for error: No strings found (in array)
$arrLength = count($arr);
if ($arrLength <= 0 ) { return $errArrZeroLength; }
$cur = 0;
foreach ($arr as $key => $val) {
if (is_string($val)) {
$min = strlen($val);
$strFirstFound = $key;
// echo("Key\tLength / Notification / Error\n");
// echo("$key\tFound first string member at key with length: $min!\n");
break;
}
else if ($strictMode) { return $errOtherType; } // At least 1 type other than string was found.
}
if (! isset($min)) { return $errStrNone; } // No string was found in array.
// SpeedRatio of foreach/for is approximately 2/1 as dicussed at:
// http://juliusbeckmann.de/blog/php-foreach-vs-while-vs-for-the-loop-battle.html
// If $strFirstFound is found within the first 1/SpeedRatio (=0.5) of the array, "foreach" is faster!
if (! $forLoop) {
foreach ($arr as $key => $val) {
if (is_string($val)) {
$cur = strlen($val);
// echo("$key\t$cur\n");
if ($cur == 0) { return $cur; } // 0 is the shortest possible string, so we can abort here.
if ($cur < $min) { $min = $cur; }
}
// else { echo("$key\tNo string!\n"); }
}
}
// If $strFirstFound is found after the first 1/SpeedRatio (=0.5) of the array, "for" is faster!
else {
for ($i = $strFirstFound + 1; $i < $arrLength; $i++) {
if (is_string($arr[$i])) {
$cur = strlen($arr[$i]);
// echo("$i\t$cur\n");
if ($cur == 0) { return $cur; } // 0 is the shortest possible string, so we can abort here.
if ($cur < $min) { $min = $cur; }
}
// else { echo("$i\tNo string!\n"); }
}
}
return $min;
}
function strCommonPrefixByStr($arr, $strFindShortestFirst = false) {
$arrLength = count($arr);
if ($arrLength < 2) { return false; }
// Determine loop length
/// Find shortest string in array: Can bring down iterations dramatically, but the function arrayStrLenMin() itself can cause ( more or less) iterations.
if ($strFindShortestFirst) { $end = arrayStrLenMin($arr, true); }
/// Simply start with length of first string in array: Seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.
else { $end = strlen($arr[0]); }
for ($i = 1; $i <= $end + 1; $i++) {
// Grab the part from 0 up to $i
$commonStrMax = substr($arr[0], 0, $i);
echo("Match: $i\t$commonStrMax\n");
// Loop through all the values in array, and compare if they match
foreach ($arr as $key => $str) {
echo(" Str: $key\t$str\n");
// Didn't match, return the part that did match
if ($commonStrMax != substr($str, 0, $i)) {
return substr($commonStrMax, 0, $i-1);
}
}
}
// Special case: No mismatch (hence no return) happened until loop end!
return $commonStrMax; // Thus entire first common string is the common prefix!
}
function strCommonPrefixByChar($arr, $strFindShortestFirst = false) {
$arrLength = count($arr);
if ($arrLength < 2) { return false; }
// Determine loop length
/// Find shortest string in array: Can bring down iterations dramatically, but the function arrayStrLenMin() itself can cause ( more or less) iterations.
if ($strFindShortestFirst) { $end = arrayStrLenMin($arr, true); }
/// Simply start with length of first string in array: Seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.
else { $end = strlen($arr[0]); }
for ($i = 0 ; $i <= $end + 1; $i++) {
// Grab char $i
$char = substr($arr[0], $i, 1);
echo("Match: $i\t"); echo(str_pad($char, $i+1, " ", STR_PAD_LEFT)); echo("\n");
// Loop through all the values in array, and compare if they match
foreach ($arr as $key => $str) {
echo(" Str: $key\t$str\n");
// Didn't match, return the part that did match
if ($char != $str[$i]) { // Same functionality as ($char != substr($str, $i, 1)). Same efficiency?
return substr($arr[0], 0, $i);
}
}
}
// Special case: No mismatch (hence no return) happened until loop end!
return substr($arr[0], 0, $end); // Thus entire first common string is the common prefix!
}
function strCommonPrefixByNeighbour($arr) {
$arrLength = count($arr);
if ($arrLength < 2) { return false; }
/// Get the common string prefix of the first 2 strings
$strCommonMax = strCommonPrefixByChar(array($arr[0], $arr[1]));
if ($strCommonMax === false) { return false; }
if ($strCommonMax == "") { return ""; }
$strCommonMaxLength = strlen($strCommonMax);
/// Now start looping from the 3rd string
echo("-----\n");
for ($i = 2; ($i < $arrLength) && ($strCommonMaxLength >= 1); $i++ ) {
echo(" STR: $i\t{$arr[$i]}\n");
/// Compare the maximum common string with the next neighbour
/*
//// Compare by char: Method unsuitable!
// Iterate from string end to string beginning
for ($ii = $strCommonMaxLength - 1; $ii >= 0; $ii--) {
echo("Match: $ii\t"); echo(str_pad($arr[$i][$ii], $ii+1, " ", STR_PAD_LEFT)); echo("\n");
// If you find the first mismatch from the end, break.
if ($arr[$i][$ii] != $strCommonMax[$ii]) {
$strCommonMaxLength = $ii - 1; break;
// BUT!!! We may falsely assume that the string from the first mismatch until the begining match! This new string neighbour string is completely "unexplored land", there might be differing chars closer to the beginning. This method is not suitable. Better use string comparison than char comparison.
}
}
*/
//// Compare by string
for ($ii = $strCommonMaxLength; $ii > 0; $ii--) {
echo("MATCH: $ii\t$strCommonMax\n");
if (substr($arr[$i],0,$ii) == $strCommonMax) {
break;
}
else {
$strCommonMax = substr($strCommonMax,0,$ii - 1);
$strCommonMaxLength--;
}
}
}
return substr($arr[0], 0, $strCommonMaxLength);
}
// Tests for finding the common prefix
/// Scenarios
$filesLeastInCommon = array (
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/1",
"/Vol/2/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/2",
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/1",
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/2",
"/Vol/2/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/c/1",
"/Vol/2/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/1",
);
$filesLessInCommon = array (
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/1",
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/2",
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/1",
"/Vol/1/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/2",
"/Vol/2/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/b/c/1",
"/Vol/2/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/a/1",
);
$filesMoreInCommon = array (
"/Voluuuuuuuuuuuuuumes/1/a/a/1",
"/Voluuuuuuuuuuuuuumes/1/a/a/2",
"/Voluuuuuuuuuuuuuumes/1/a/b/1",
"/Voluuuuuuuuuuuuuumes/1/a/b/2",
"/Voluuuuuuuuuuuuuumes/2/a/b/c/1",
"/Voluuuuuuuuuuuuuumes/2/a/a/1",
);
$sameDir = array (
"/Volumes/1/a/a/",
"/Volumes/1/a/a/aaaaa/2",
);
$sameFile = array (
"/Volumes/1/a/a/1",
"/Volumes/1/a/a/1",
);
$noCommonPrefix = array (
"/Volumes/1/a/a/",
"/Volumes/1/a/a/aaaaa/2",
"Net/1/a/a/aaaaa/2",
);
$longestLast = array (
"/Volumes/1/a/a/1",
"/Volumes/1/a/a/aaaaa/2",
);
$longestFirst = array (
"/Volumes/1/a/a/aaaaa/1",
"/Volumes/1/a/a/2",
);
$one = array ("/Volumes/1/a/a/aaaaa/1");
$empty = array ( );
// Test Results for finding the common prefix
/*
I tested my functions in many possible scenarios.
The results, the common prefixes, were always correct in all scenarios!
Just try a function call with your individual array!
Considering iteration efficiency, I also performed tests:
I put echo functions into the functions where iterations occur, and measured the number of CLI line output via:
php <script with strCommonPrefixByStr or strCommonPrefixByChar> | egrep "^ Str:" | wc -l GIVES TOTAL ITERATION SUM.
php <Script with strCommonPrefixByNeighbour> | egrep "^ Str:" | wc -l PLUS | egrep "^MATCH:" | wc -l GIVES TOTAL ITERATION SUM.
My hypothesis was proven:
strCommonPrefixByChar wins in situations where the strings have less in common in their beginning (=prefix).
strCommonPrefixByNeighbour wins where there is more in common in the prefixes.
*/
// Test Results Table
// Used Functions | Iteration amount | Remarks
// $result = (strCommonPrefixByStr($filesLessInCommon)); // 35
// $result = (strCommonPrefixByChar($filesLessInCommon)); // 35 // Same amount of iterations, but much fewer characters compared because ByChar instead of ByString!
// $result = (strCommonPrefixByNeighbour($filesLessInCommon)); // 88 + 42 = 130 // Loses in this category!
// $result = (strCommonPrefixByStr($filesMoreInCommon)); // 137
// $result = (strCommonPrefixByChar($filesMoreInCommon)); // 137 // Same amount of iterations, but much fewer characters compared because ByChar instead of ByString!
// $result = (strCommonPrefixByNeighbour($filesLeastInCommon)); // 12 + 4 = 16 // Far the winner in this category!
echo("Common prefix of all members:\n");
var_dump($result);
// Tests for finding the shortest string in array
/// Arrays
// $empty = array ();
// $noStrings = array (0,1,2,3.0001,4,false,true,77);
// $stringsOnly = array ("one","two","three","four");
// $mixed = array (0,1,2,3.0001,"four",false,true,"seven", 8888);
/// Scenarios
// I list them from fewest to most iterations, which is not necessarily equivalent to slowest to fastest!
// For speed consider the remarks in the code considering the Speed ratio of foreach/for!
//// Fewest iterations (immediate abort on "Found other type", use "for" loop)
// foreach( array($empty, $noStrings, $stringsOnly, $mixed) as $arr) {
// echo("NEW ANALYSIS:\n");
// echo("Result: " . arrayStrLenMin($arr, true, true) . "\n\n");
// }
/* Results:
NEW ANALYSIS:
Result: Array is empty!
NEW ANALYSIS:
Result: Found other type!
NEW ANALYSIS:
Key Length / Notification / Error
0 Found first string member at key with length: 3!
1 3
2 5
3 4
Result: 3
NEW ANALYSIS:
Result: Found other type!
*/
//// Fewer iterations (immediate abort on "Found other type", use "foreach" loop)
// foreach( array($empty, $noStrings, $stringsOnly, $mixed) as $arr) {
// echo("NEW ANALYSIS:\n");
// echo("Result: " . arrayStrLenMin($arr, true, false) . "\n\n");
// }
/* Results:
NEW ANALYSIS:
Result: Array is empty!
NEW ANALYSIS:
Result: Found other type!
NEW ANALYSIS:
Key Length / Notification / Error
0 Found first string member at key with length: 3!
0 3
1 3
2 5
3 4
Result: 3
NEW ANALYSIS:
Result: Found other type!
*/
//// More iterations (No immediate abort on "Found other type", use "for" loop)
// foreach( array($empty, $noStrings, $stringsOnly, $mixed) as $arr) {
// echo("NEW ANALYSIS:\n");
// echo("Result: " . arrayStrLenMin($arr, false, true) . "\n\n");
// }
/* Results:
NEW ANALYSIS:
Result: Array is empty!
NEW ANALYSIS:
Result: No strings found!
NEW ANALYSIS:
Key Length / Notification / Error
0 Found first string member at key with length: 3!
1 3
2 5
3 4
Result: 3
NEW ANALYSIS:
Key Length / Notification / Error
4 Found first string member at key with length: 4!
5 No string!
6 No string!
7 5
8 No string!
Result: 4
*/
//// Most iterations (No immediate abort on "Found other type", use "foreach" loop)
// foreach( array($empty, $noStrings, $stringsOnly, $mixed) as $arr) {
// echo("NEW ANALYSIS:\n");
// echo("Result: " . arrayStrLenMin($arr, false, false) . "\n\n");
// }
/* Results:
NEW ANALYSIS:
Result: Array is empty!
NEW ANALYSIS:
Result: No strings found!
NEW ANALYSIS:
Key Length / Notification / Error
0 Found first string member at key with length: 3!
0 3
1 3
2 5
3 4
Result: 3
NEW ANALYSIS:
Key Length / Notification / Error
4 Found first string member at key with length: 4!
0 No string!
1 No string!
2 No string!
3 No string!
4 4
5 No string!
6 No string!
7 5
8 No string!
Result: 4
*/
Probably there is some terribly well-regarded algorithm for this, but just off the top of my head, if you know your commonality is going to be on the left-hand side like in your example, you could do way better than your posted methodology by first finding the commonality of the first two strings, and then iterating down the rest of the list, trimming the common string as necessary to achieve commonality or terminating with failure if you trim all the way to nothing.
I think you're on the right way. But instead of incrementing i when all of the string passes, you could do this:
1) Compare the first 2 strings in the array and find out how many common characters they have. Save the common characters in a separate string called maxCommon, for example.
2) Compare the third string w/ maxCommon. If the number of common characters is smaller, trim maxCommon to the characters that match.
3) Repeat and rinse for the rest of the array. At the end of the process, maxCommon will have the string that is common to all of the array elements.
This will add some overhead because you'll need to compare each string w/ maxCommon, but will drastically reduce the number of iterations you'll need to get your results.
I assume that by "common part" you mean "longest common prefix". That is a much simpler to compute than any common substring.
This cannot be done without reading (n+1) * m characters in the worst case and n * m + 1 in the best case, where n is the length of the longest common prefix and m is the number of strings.
Comparing one letter at a time achieves that efficiency (Big Theta (n * m)).
Your proposed algorithm runs in Big Theta(n^2 * m), which is much, much slower for large inputs.
The third proposed algorithm of finding the longest prefix of the first two strings, then comparing that with the third, fourth, etc. also has a running time in Big Theta(n * m), but with a higher constant factor. It will probably only be slightly slower in practice.
Overall, I would recommend just rolling your own function, since the first algorithm is too slow and the two others will be about equally complicated to write anyway.
Check out WikiPedia for a description of Big Theta notation.
Here's an elegant, recursive implementation in JavaScript:
function prefix(strings) {
switch (strings.length) {
case 0:
return "";
case 1:
return strings[0];
case 2:
// compute the prefix between the two strings
var a = strings[0],
b = strings[1],
n = Math.min(a.length, b.length),
i = 0;
while (i < n && a.charAt(i) === b.charAt(i))
++i;
return a.substring(0, i);
default:
// return the common prefix of the first string,
// and the common prefix of the rest of the strings
return prefix([ strings[0], prefix(strings.slice(1)) ]);
}
}
not that I know of
yes: instead of comparing the substring from 0 to length i, you can simply check the ith character (you already know that characters 0 to i-1 match).
Short and sweet version, perhaps not the most efficient:
/// Return length of longest common prefix in an array of strings.
function _commonPrefix($array) {
if(count($array) < 2) {
if(count($array) == 0)
return false; // empty array: undefined prefix
else
return strlen($array[0]); // 1 element: trivial case
}
$len = max(array_map('strlen',$array)); // initial upper limit: max length of all strings.
$prevval = reset($array);
while(($newval = next($array)) !== FALSE) {
for($j = 0 ; $j < $len ; $j += 1)
if($newval[$j] != $prevval[$j])
$len = $j;
$prevval = $newval;
}
return $len;
}
// TEST CASE:
$arr = array('/var/yam/yamyam/','/var/yam/bloorg','/var/yar/sdoo');
print_r($arr);
$plen = _commonprefix($arr);
$pstr = substr($arr[0],0,$plen);
echo "Res: $plen\n";
echo "==> ".$pstr."\n";
echo "dir: ".dirname($pstr.'aaaa')."\n";
Output of the test case:
Array
(
[0] => /var/yam/yamyam/
[1] => /var/yam/bloorg
[2] => /var/yar/sdoo
)
Res: 7
==> /var/ya
dir: /var
#bumperbox
Your basic code needed some correction to work in ALL scenarios!
Your loop only compares until one character before the last character!
The mismatch can possibly occur 1 loop cycle after the latest common character.
Hence you have to at least check until 1 character after your first string's last character.
Hence your comparison operator must be "<= 1" or "< 2".
Currently your algorithm fails
if the first string is completely included in all other strings,
or completely included in all other strings except the last character.
In my next answer/post, I will attach iteration optimized code!
Original Bumperbox code PLUS correction (PHP):
function shortest($sports) {
$i = 1;
// loop to the length of the first string
while ($i < strlen($sports[0])) {
// grab the left most part up to i in length
// REMARK: Culturally biased towards LTR writing systems. Better say: Grab frombeginning...
$match = substr($sports[0], 0, $i);
// loop through all the values in array, and compare if they match
foreach ($sports as $sport) {
if ($match != substr($sport, 0, $i)) {
// didn't match, return the part that did match
return substr($sport, 0, $i-1);
}
}
$i++; // increase string length
}
}
function shortestCorrect($sports) {
$i = 1;
while ($i <= strlen($sports[0]) + 1) {
// Grab the string from its beginning with length $i
$match = substr($sports[0], 0, $i);
foreach ($sports as $sport) {
if ($match != substr($sport, 0, $i)) {
return substr($sport, 0, $i-1);
}
}
$i++;
}
// Special case: No mismatch happened until loop end! Thus entire str1 is common prefix!
return $sports[0];
}
$sports1 = array(
'Softball',
'Softball - Eastern',
'Softball - North Harbour');
$sports2 = array(
'Softball - Wester',
'Softball - Western',
);
$sports3 = array(
'Softball - Western',
'Softball - Western',
);
$sports4 = array(
'Softball - Westerner',
'Softball - Western',
);
echo("Output of the original function:\n"); // Failure scenarios
var_dump(shortest($sports1)); // NULL rather than the correct 'Softball'
var_dump(shortest($sports2)); // NULL rather than the correct 'Softball - Wester'
var_dump(shortest($sports3)); // NULL rather than the correct 'Softball - Western'
var_dump(shortest($sports4)); // Only works if the second string is at least one character longer!
echo("\nOutput of the corrected function:\n"); // All scenarios work
var_dump(shortestCorrect($sports1));
var_dump(shortestCorrect($sports2));
var_dump(shortestCorrect($sports3));
var_dump(shortestCorrect($sports4));
How about something like this? It can be further optimised by not having to check the lengths of the strings if we can use the null terminating character (but I am assuming python strings have length cached somewhere?)
def find_common_prefix_len(strings):
"""
Given a list of strings, finds the length common prefix in all of them.
So
apple
applet
application
would return 3
"""
prefix = 0
curr_index = -1
num_strings = len(strings)
string_lengths = [len(s) for s in strings]
while True:
curr_index += 1
ch_in_si = None
for si in xrange(0, num_strings):
if curr_index >= string_lengths[si]:
return prefix
else:
if si == 0:
ch_in_si = strings[0][curr_index]
elif strings[si][curr_index] != ch_in_si:
return prefix
prefix += 1
I would use a recursive algorithm like this:
1 - get the first string in the array
2 - call the recursive prefix method with the first string as a param
3 - if prefix is empty return no prefix
4 - loop through all the strings in the array
4.1 - if any of the strings does not start with the prefix
4.1.1 - call recursive prefix method with prefix - 1 as a param
4.2 return prefix
// Common prefix
$common = '';
$sports = array(
'Softball T - Counties',
'Softball T - Eastern',
'Softball T - North Harbour',
'Softball T - South',
'Softball T - Western'
);
// find mini string
$minLen = strlen($sports[0]);
foreach ($sports as $s){
if($minLen > strlen($s))
$minLen = strlen($s);
}
// flag to break out of inner loop
$flag = false;
// The possible common string length does not exceed the minimum string length.
// The following solution is O(n^2), this can be improve.
for ($i = 0 ; $i < $minLen; $i++){
$tmp = $sports[0][$i];
foreach ($sports as $s){
if($s[$i] != $tmp)
$flag = true;
}
if($flag)
break;
else
$common .= $sports[0][$i];
}
print $common;
The solutions here work only for finding commonalities at the beginning of strings. Here is a function that looks for the longest common substring anywhere in an array of strings.
http://www.christopherbloom.com/2011/02/24/find-the-longest-common-substring-using-php/
The top answer seemed a bit long, so here's a concise solution with a runtime of O(n2).
function findLongestPrefix($arr) {
return array_reduce($arr, function($prefix, $item) {
$length = min(strlen($prefix), strlen($item));
while (substr($prefix, 0, $length) !== substr($item, 0, $length)) {
$length--;
}
return substr($prefix, 0, $length);
}, $arr[0]);
}
print findLongestPrefix($sports); // Softball -
For what it's worth, here's another alternative I came up with.
I used this for finding the common prefix for a list of products codes (ie. where there are multiple product SKUs that have a common series of characters at the start):
/**
* Try to find a common prefix for a list of strings
*
* #param array $strings
* #return string
*/
function findCommonPrefix(array $strings)
{
$prefix = '';
$chars = array_map("str_split", $strings);
$matches = call_user_func_array("array_intersect_assoc", $chars);
if ($matches) {
$i = 0;
foreach ($matches as $key => $value) {
if ($key != $i) {
unset($matches[$key]);
}
$i++;
}
$prefix = join('', $matches);
}
return $prefix;
}
This is an addition to the #Gumbo answer. If you want to ensure that the chosen, common prefix does not break words, use this. I am just having it look for a blank space at the end of the chosen string. If that exists we know that there was more to all of the phrases, so we truncate it.
function product_name_intersection($array){
$pl = 0; // common prefix length
$n = count($array);
$l = strlen($array[0]);
$first = current($array);
while ($pl < $l) {
$c = $array[0][$pl];
for ($i=1; $i<$n; $i++) {
if (!isset($array[$i][$pl]) || $array[$i][$pl] !== $c) break 2;
}
$pl++;
}
$prefix = substr($array[0], 0, $pl);
if ($pl < strlen($first) && substr($prefix, -1, 1) != ' ') {
$prefix = preg_replace('/\W\w+\s*(\W*)$/', '$1', $prefix);
}
$prefix = preg_replace('/^\W*(.+?)\W*$/', '$1', $prefix);
return $prefix;
}
Sharing a Typescript solution for this question. I split it into 2 methods, just to keep it clean while at it.
function longestCommonPrefix(strs: string[]): string {
let output = '';
if(strs.length > 0) {
output = strs[0];
if(strs.length > 1) {
for(let i=1; i <strs.length; i++) {
output = checkCommonPrefix(output, strs[i]);
}
}
}
return output;
};
function checkCommonPrefix(str1: string, str2: string): string {
let output = '';
let len = Math.min(str1.length, str2.length);
let i = 0;
while(i < len) {
if(str1[i] === str2[i]) {
output += str1[i];
} else {
i = len;
}
i++;
}
return output;
}

Generating Ordered (Weighted) Combinations of Arbitrary Length in PHP

Given a list of common words, sorted in order of prevalence of use, is it possible to form word combinations of an arbitrary length (any desired number of words) in order of the 'most common' sequences. For example,if the most common words are 'a, b, c' then for combinations of length two, the following would be generated:
aa
ab
ba
bb
ac
bc
ca
cb
cc
Here is the correct list for length 3:
aaa
aab
aba
abb
baa
bab
bba
bbb
aac
abc
bac
bbc
aca
acb
bca
bcb
acc
bcc
caa
cab
cba
cbb
cac
cbc
cca
ccb
ccc
This is simple to implement for combinations of 2 or 3 words (set length) for any number of elements, but can this be done for arbitrary lengths? I want to implement this in PHP, but pseudocode or even a summary of the algorithm would be much appreciated!
Here's a recursive function that might be what you need. The idea is, when given a length and a letter, to first generate all sequences that are one letter shorter that don't include that letter. Add the new letter to the end and you have the first part of the sequence that involves that letter. Then move the new letter to the left. Cycle through each sequence of letters including the new one to the right.
So if you had gen(5, d)
It would start with
(aaaa)d
(aaab)d
...
(cccc)d
then when it got done with the a-c combinations it would do
(aaa)d(a)
...
(aaa)d(d)
(aab)d(d)
...
(ccc)d(d)
then when it got done with d as the 4th letter it would move it to the 3rd
(aa)d(aa)
etc., etc.
<?php
/**
* Word Combinations (version c) 6/22/2009 1:20:14 PM
*
* Based on pseudocode in answer provided by Erika:
* http://stackoverflow.com/questions/1024471/generating-ordered-weighted-combinations-of-arbitrary-length-in-php/1028356#1028356
* (direct link to Erika's answer)
*
* To see the results of this script, run it:
* http://stage.dustinfineout.com/stackoverflow/20090622/word_combinations_c.php
**/
init_generator();
function init_generator() {
global $words;
$words = array('a','b','c');
generate_all(5);
}
function generate_all($len){
global $words;
for($i = 0; $i < count($words); $i++){
$res = generate($len, $i);
echo join("<br />", $res);
echo("<br/>");
}
}
function generate($len, $max_index = -1){
global $words;
// WHEN max_index IS NEGATIVE, STARTING POSITION
if ($max_index < 0) {
$max_index = count($words) - 1;
}
$list = array();
if ($len <= 0) {
$list[] = "";
return $list;
}
if ($len == 1) {
if ($max_index >= 1) {
$add = generate(1, ($max_index - 1));
foreach ($add as $addit) {
$list[] = $addit;
}
}
$list[] = $words[$max_index];
return $list;
}
if($max_index == 0) {
$list[] = str_repeat($words[$max_index], $len);
return $list;
}
for ($i = 1; $i <= $len; $i++){
$prefixes = generate(($len - $i), ($max_index - 1));
$postfixes = generate(($i - 1), $max_index);
foreach ($prefixes as $pre){
//print "prefix = $pre<br/>";
foreach ($postfixes as $post){
//print "postfix = $post<br/>";
$list[] = ($pre . $words[$max_index] . $post);
}
}
}
return $list;
}
?>
I googled for php permutations and got: http://www.php.happycodings.com/Algorithms/code21.html
I haven't looked into the code if it is good or not. But it seems to do what you want.
I don't know what the term is for what you're trying to calculate, but it's not combinations or even permutations, it's some sort of permutations-with-repetition.
Below I've enclosed some slightly-adapted code from the nearest thing I have lying around that does something like this, a string permutation generator in LPC. For a, b, c it generates
abc
bac
bca
acb
cab
cba
Probably it can be tweaked to enable the repetition behavior you want.
varargs mixed array permutations(mixed array list, int num) {
mixed array out = ({});
foreach(mixed item : permutations(list[1..], num - 1))
for(int i = 0, int j = sizeof(item); i <= j; i++)
out += ({ implode(item[0 .. i - 1] + ({ list[0] }) + item[i..], "") });
if(num < sizeof(list))
out += permutations(list[1..], num);
return out;
}
FWIW, another way of stating your problem is that, for an input of N elements, you want the set of all paths of length N in a fully-connected, self-connected graph with the input elements as nodes.
I'm assuming that when saying it's easy for fixed length, you're using m nested loops, where m is the lenght of the sequence (2 and 3 in your examples).
You could use recursion like this:
Your words are numbered 0, 1, .. n, you need to generate all sequences of length m:
generate all sequences of length m:
{
start with 0, and generate all sequences of length m-1
start with 1, and generate all sequences of length m-1
...
start with n, and generate all sequences of length m-1
}
generate all sequences of length 0
{
// nothing to do
}
How to implement this? Well, in each call you can push one more element to the end of the array, and when you hit the end of the recursion, print out array's contents:
// m is remaining length of sequence, elements is array with numbers so far
generate(m, elements)
{
if (m == 0)
{
for j = 0 to elements.length print(words[j]);
}
else
{
for i = 0 to n - 1
{
generate(m-1, elements.push(i));
}
}
}
And finally, call it like this: generate(6, array())

Categories