I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters (n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters) has to be much balanced as possible across lines.
Example of input text:
How to do things
Wrong output (this is the normal word-wrap behavior), m=6:
How to
do
things
Desired output, always m=6:
How
to do
things
Does anyone have suggestions or guidelines on how to implement this function? Basically, I'm searching something for pretty print short phrases on two or three (as much as possible) equal length lines.
Update: It seems I'm searching exactly for a Minimum raggedness word wrap algorithm. But I can't find any implementation in a real programming language (anyone, then I can convert it in PHP).
Update 2: I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searching procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm.
I've implemented on the same lines of Alex, coding the Wikipedia algorithm, but directly in PHP (an interesting exercise to me). Understanding how to use the optimal cost function f(j), i.e. the 'recurrence' part, is not very easy. Thanks to Alex for the well commented code.
/**
* minimumRaggedness
*
* #param string $input paragraph. Each word separed by 1 space.
* #param int $LineWidth the max chars per line.
* #param string $lineBreak wrapped lines separator.
*
* #return string $output the paragraph wrapped.
*/
function minimumRaggedness($input, $LineWidth, $lineBreak = "\n")
{
$words = explode(" ", $input);
$wsnum = count($words);
$wslen = array_map("strlen", $words);
$inf = 1000000; //PHP_INT_MAX;
// keep Costs
$C = array();
for ($i = 0; $i < $wsnum; ++$i)
{
$C[] = array();
for ($j = $i; $j < $wsnum; ++$j)
{
$l = 0;
for ($k = $i; $k <= $j; ++$k)
$l += $wslen[$k];
$c = $LineWidth - ($j - $i) - $l;
if ($c < 0)
$c = $inf;
else
$c = $c * $c;
$C[$i][$j] = $c;
}
}
// apply recurrence
$F = array();
$W = array();
for ($j = 0; $j < $wsnum; ++$j)
{
$F[$j] = $C[0][$j];
$W[$j] = 0;
if ($F[$j] == $inf)
{
for ($k = 0; $k < $j; ++$k)
{
$t = $F[$k] + $C[$k + 1][$j];
if ($t < $F[$j])
{
$F[$j] = $t;
$W[$j] = $k + 1;
}
}
}
}
// rebuild wrapped paragraph
$output = "";
if ($F[$wsnum - 1] < $inf)
{
$S = array();
$j = $wsnum - 1;
for ( ; ; )
{
$S[] = $j;
$S[] = $W[$j];
if ($W[$j] == 0)
break;
$j = $W[$j] - 1;
}
$pS = count($S) - 1;
do
{
$i = $S[$pS--];
$j = $S[$pS--];
for ($k = $i; $k < $j; $k++)
$output .= $words[$k] . " ";
$output .= $words[$k] . $lineBreak;
}
while ($j < $wsnum - 1);
}
else
$output = $input;
return $output;
}
?>
Quick and dirty, in c++
#include <sstream>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <memory.h>
using namespace std;
int cac[1000][1000];
string res[1000][1000];
vector<string> words;
int M;
int go(int a, int b){
if(cac[a][b]>= 0) return cac[a][b];
if(a == b) return 0;
int csum = -1;
for(int i=a; i<b; ++i){
csum += words[i].size() + 1;
}
if(csum <= M || a == b-1){
string sep = "";
for(int i=a; i<b; ++i){
res[a][b].append(sep);
res[a][b].append(words[i]);
sep = " ";
}
return cac[a][b] = (M-csum)*(M-csum);
}
int ret = 1000000000;
int best_sp = -1;
for(int sp=a+1; sp<b; ++sp){
int cur = go(a, sp) + go(sp,b);
if(cur <= ret){
ret = cur;
best_sp = sp;
}
}
res[a][b] = res[a][best_sp] + "\n" + res[best_sp][b];
return cac[a][b] = ret;
}
int main(int argc, char ** argv){
memset(cac, -1, sizeof(cac));
M = atoi(argv[1]);
string word;
while(cin >> word) words.push_back(word);
go(0, words.size());
cout << res[0][words.size()] << endl;
}
Test:
$ echo "The quick brown fox jumps over a lazy dog" |./a.out 10
The quick
brown fox
jumps over
a lazy dog
EDIT: just looked at the wikipedia page for minimum raggedness word wrap. Changed algorithm to the given one (with squared penalties)
A C version:
// This is a direct implementation of the minimum raggedness word wrapping
// algorithm from http://en.wikipedia.org/wiki/Word_wrap#Minimum_raggedness
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <limits.h>
const char* pText = "How to do things";
int LineWidth = 6;
int WordCnt;
const char** pWords;
int* pWordLengths;
int* pC;
int* pF;
int* pW;
int* pS;
int CountWords(const char* p)
{
int cnt = 0;
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
cnt++;
while (*p != '\0' && !isspace(*p)) p++;
}
}
return cnt;
}
void FindWords(const char* p, int cnt, const char** pWords, int* pWordLengths)
{
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
*pWords++ = p;
while (*p != '\0' && !isspace(*p)) p++;
*pWordLengths++ = p - pWords[-1];
}
}
}
void PrintWord(const char* p, int l)
{
int i;
for (i = 0; i < l; i++)
printf("%c", p[i]);
}
// 1st program's argument is the text
// 2nd program's argument is the line width
int main(int argc, char* argv[])
{
int i, j;
if (argc >= 3)
{
pText = argv[1];
LineWidth = atoi(argv[2]);
}
WordCnt = CountWords(pText);
pWords = malloc(WordCnt * sizeof(*pWords));
pWordLengths = malloc(WordCnt * sizeof(*pWordLengths));
FindWords(pText, WordCnt, pWords, pWordLengths);
printf("Input Text: \"%s\"\n", pText);
printf("Line Width: %d\n", LineWidth);
printf("Words : %d\n", WordCnt);
#if 0
for (i = 0; i < WordCnt; i++)
{
printf("\"");
PrintWord(pWords[i], pWordLengths[i]);
printf("\"\n");
}
#endif
// Build c(i,j) in pC[]
pC = malloc(WordCnt * WordCnt * sizeof(int));
for (i = 0; i < WordCnt; i++)
{
for (j = 0; j < WordCnt; j++)
if (j >= i)
{
int k;
int c = LineWidth - (j - i);
for (k = i; k <= j; k++) c -= pWordLengths[k];
c = (c >= 0) ? c * c : INT_MAX;
pC[j * WordCnt + i] = c;
}
else
pC[j * WordCnt + i] = INT_MAX;
}
// Build f(j) in pF[] and store the wrap points in pW[]
pF = malloc(WordCnt * sizeof(int));
pW = malloc(WordCnt * sizeof(int));
for (j = 0; j < WordCnt; j++)
{
pW[j] = 0;
if ((pF[j] = pC[j * WordCnt]) == INT_MAX)
{
int k;
for (k = 0; k < j; k++)
{
int s;
if (pF[k] == INT_MAX || pC[j * WordCnt + k + 1] == INT_MAX)
s = INT_MAX;
else
s = pF[k] + pC[j * WordCnt + k + 1];
if (pF[j] > s)
{
pF[j] = s;
pW[j] = k + 1;
}
}
}
}
// Print the optimal solution cost
printf("f : %d\n", pF[WordCnt - 1]);
// Print the optimal solution, if any
pS = malloc(2 * WordCnt * sizeof(int));
if (pF[WordCnt - 1] != INT_MAX)
{
// Work out the solution's words by back tracking the
// wrap points from pW[] and store them on the pS[] stack
j = WordCnt - 1;
for (;;)
{
*pS++ = j;
*pS++ = pW[j];
if (!pW[j]) break;
j = pW[j] - 1;
}
// Print the solution line by line, word by word
// in direct order
do
{
int k;
i = *--pS;
j = *--pS;
for (k = i; k <= j; k++)
{
PrintWord(pWords[k], pWordLengths[k]);
printf(" ");
}
printf("\n");
} while (j < WordCnt - 1);
}
return 0;
}
Output 1:
ww.exe
Input Text: "How to do things"
Line Width: 6
Words : 4
f : 10
How
to do
things
Output 2:
ww.exe "aaa bb cc ddddd" 6
Input Text: "aaa bb cc ddddd"
Line Width: 6
Words : 4
f : 11
aaa
bb cc
ddddd
Output 3:
ww.exe "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm." 60
Input Text: "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm."
Line Width: 60
Words : 73
f : 241
I started a bounty for this. Is it possible that do not
exist any public implementation of Minimum raggedness
algorithm in any procedural language? I need something
written in a way that can be translated into procedural
instructions. All I can find now is just a bounch of
(generic) equation that however need a optimal searhing
procedure. I will be grateful also for an implementation
that can only approximate that optimal searching algorithm.
I think the simplest way to look at it - is with iteration between limits
E.g.
/**
* balancedWordWrap
*
* #param string $input
* #param int $maxWidth the max chars per line
*/
function balancedWordWrap($input, $maxWidth = null) {
$length = strlen($input);
if (!$maxWidth) {
$maxWidth = min(ceil($length / 2), 75);
}
$minWidth = min(ceil($length / 2), $maxWidth / 2);
$permutations = array();
$scores = array();
$lowestScore = 999;
$lowest = $minWidth;
foreach(range($minWidth, $maxWidth) as $width) {
$permutations[$width] = wordwrap($input, $width);
$lines = explode("\n", $permutations[$width]);
$max = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
if ($lineLength > $max) {
$max = $lineLength;
}
}
$score = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
$score += pow($max - $lineLength, 2);
}
$scores[$width] = $score;
if ($score < $lowestScore) {
$lowestScore = $score;
$lowest = $width;
}
}
return $permutations[$lowest];
}
Given the input "how to do things"
it outputs
How
to do
things
Given the input "Mary had a little lamb"
it outputs
Mary had a
little lamb
Given the input "This extra-long paragraph was writtin to demonstrate how the fmt(1) program handles longer inputs. When testing inputs, you don\'t want them to be too short, nor too long, because the quality of the program can only be determined upon inspection of complex content. The quick brown fox jumps over the lazy dog. Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.", and limited to 75 chars max width, it outputs:
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
be too short, nor too long, because the quality of the program can only be
determined upon inspection of complex content. The quick brown fox jumps
over the lazy dog. Congress shall make no law respecting an establishment
of religion, or prohibiting the free exercise thereof; or abridging the
freedom of speech, or of the press; or the right of the people peaceably
to assemble, and to petition the Government for a redress of grievances.
Justin's link to Knuth's Breaking Paragraphs Into Lines is the historically best answer. (Newer systems also apply microtypography techniques such as fiddling with character widths, kerning, and so on, but if you're simply looking for monospaced plain-text, these extra approaches won't help.)
If you just want to solve the problem, the fmt(1) utility supplied on many Linux systems by the Free Software Foundation implements a variant of Knuth's algorithm that also attempts to avoid line breaks at the end of sentences. I wrote your inputs and a larger example, and ran them through fmt -w 20 to force 20-character lines:
$ fmt -w 20 input
Lorem ipsum dolor
sit amet
Supercalifragilisticexpialidocious
and some other
small words
One long
extra-long-word
This extra-long
paragraph
was writtin to
demonstrate how the
`fmt(1)` program
handles longer
inputs. When
testing inputs,
you don't want them
to be too short,
nor too long,
because the quality
of the program can
only be determined
upon inspection
of complex
content. The quick
brown fox jumps
over the lazy
dog. Congress
shall make no
law respecting
an establishment
of religion, or
prohibiting the
free exercise
thereof; or
abridging the
freedom of speech,
or of the press;
or the right of the
people peaceably
to assemble,
and to petition
the Government
for a redress of
grievances.
The output looks much better if you allow it the default 75 characters width for non-trivial input:
$ fmt input
Lorem ipsum dolor sit amet
Supercalifragilisticexpialidocious and some other small words
One long extra-long-word
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
to be too short, nor too long, because the quality of the program can
only be determined upon inspection of complex content. The quick brown
fox jumps over the lazy dog. Congress shall make no law respecting an
establishment of religion, or prohibiting the free exercise thereof;
or abridging the freedom of speech, or of the press; or the right of
the people peaceably to assemble, and to petition the Government for a
redress of grievances.
Here is a bash version:
#! /bin/sh
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
echo "Usage: balance <width> [ <string> ]"
echo " "
echo " if string is not passed as parameter it will be read from STDIN\n"
exit 2
elif [ $# -le 1 ] ; then
LINE=`cat`
else
LINE="$2"
fi
LINES=`echo "$LINE" | fold -s -w $1 | wc -l`
MAX=$1
MIN=0
while [ $MAX -gt $(($MIN+1)) ]
do
TRY=$(( $MAX + $MIN >> 1 ))
NUM=`echo "$LINE" | fold -s -w $TRY | wc -l`
if [ $NUM -le $LINES ] ; then
MAX=$TRY
else
MIN=$TRY
fi
done
echo "$LINE" | fold -s -w $MAX
example:
$ balance 50 "Now is the time for all good men to come to the aid of the party."
Now is the time for all good men
to come to the aid of the party.
Requires 'fold' and 'wc' which are usually available where bash is installed.
Related
I've written a small PHP function to find a length of a longest palindromic substring of a string. To avoid many loops I've used a recursion.
The idea behind algorithm is, to loop through an array and for each center (including centers between characters and on a character), recursively check left and right caret values for equality. Iteration for a particular center ends when characters are not equal or one of the carets is out of the array (word) range.
Questions:
1) Could you please write a math calculations which should be used to explain time complexity of this algorithm? In my understanding its O(n^2), but I'm struggling to confirm that with a detailed calculations.
2) What do you think about this solution, any improvement suggestions (considering it was written in 45 mins just for practice)? Are there better approaches from the time complexity perspective?
To simplify the example I've dropped some input checks (more in comments).
Thanks guys, cheers.
<?php
/**
* Find length of the longest palindromic substring of a string.
*
* O(n^2)
* questions by developer
* 1) Is the solution meant to be case sensitive? (no)
* 2) Do phrase palindromes need to be taken into account? (no)
* 3) What about punctuation? (no)
*/
$input = 'tttabcbarabb';
$input2 = 'taat';
$input3 = 'aaaaaa';
$input4 = 'ccc';
$input5 = 'bbbb';
$input6 = 'axvfdaaaaagdgre';
$input7 = 'adsasdabcgeeegcbgtrhtyjtj';
function getLenRecursive($l, $r, $word)
{
if ($word === null || strlen($word) === 0) {
return 0;
}
if ($l < 0 || !isset($word[$r]) || $word[$l] != $word[$r]) {
$longest = ($r - 1) - ($l + 1) + 1;
return !$longest ? 1 : $longest;
}
--$l;
++$r;
return getLenRecursive($l, $r, $word);
}
function getLongestPalSubstrLength($inp)
{
if ($inp === null || strlen($inp) === 0) {
return 0;
}
$longestLength = 1;
for ($i = 0; $i <= strlen($inp); $i++) {
$l = $i - 1;
$r = $i + 1;
$length = getLenRecursive($l, $r, $inp); # around char
if ($i > 0) {
$length2 = getLenRecursive($l, $i, $inp); # around center
$longerOne = $length > $length2 ? $length : $length2;
} else {
$longerOne = $length;
}
$longestLength = $longerOne > $longestLength ? $longerOne : $longestLength;
}
return $longestLength;
}
echo 'expected: 5, got: ';
var_dump(getLongestPalSubstrLength($input));
echo 'expected: 4, got: ';
var_dump(getLongestPalSubstrLength($input2));
echo 'expected: 6, got: ';
var_dump(getLongestPalSubstrLength($input3));
echo 'expected: 3, got: ';
var_dump(getLongestPalSubstrLength($input4));
echo 'expected: 4, got: ';
var_dump(getLongestPalSubstrLength($input5));
echo 'expected: 5, got: ';
var_dump(getLongestPalSubstrLength($input6));
echo 'expected: 9, got: ';
var_dump(getLongestPalSubstrLength($input7));
Your code doesn't really need to be recursive. A simple while loop would do just fine.
Yes, complexity is O(N^2). You have N options for selecting the middle point. The number of recursion steps goes from 1 to N/2. The sum of all that is 2 * (N/2) * (n/2 + 1) /2 and that is O(N^2).
For code review, I wouldn't do recursion here since it's fairly straightforward and you don't need the stack at all. I would replace it with a while loop (still in a separate function, to make the code more readable).
Using VBA code I found in a spreadsheet to adapt to an online PHP application. The code uses the bisection method in mathematics to find optimal value for a calculation required to price options.
Upper = estimate_upper
Lower = estimate_lower
UUpper = container
Start_Iteration:
IterationCountE = 0.000000001
While (Upper - Lower) > IterationCountE
Mid = (Upper + Lower) / 2
c1 = calculations1...
c2 = calculations2...
If (c2 - c1) > 0 Then
Lower = Mid
Else
Upper = MId
End If
Wend 'Ends the while loop
If (Round(Mid, 4) = Round(UUpper, 4)) Then
Upper = 2 * UUpper
UUpper = Upper
GoTo Start_Iteration
End If
Function = Mid
For most part, I understand the mechanics of the iteration. My attempted PHP conversion is as follows:
$IterationCountE = 0.00000000001;
while ( ($Upper - $Lower) > $IterationCountE ) {
$Mid = ($Upper + $Lower) / 2;
$c1 = calculation1();
$c2 = calculation2();
if ( ($c2 - $c1) > 0 ) {
$Lower = $Mid;
} else {
$Upper = $Mid;
}
if (round($Mid, 4) == round($UUpper, 4)) {
$Upper = 2 * $UUpper;
$UUpper = $Upper;
}
}
return $Mid;
Is this the best way to approach an similar iteration in PHP? Would it be better to wrap the iteration in a function and refer back to it like in the VBA code?
I do not get the same value results when comparing the PHP output to the value from a macro.
I have recently come across an interesting question on strings. Suppose you are given following:
Input string1: "this is a test string"
Input string2: "tist"
Output string: "t stri"
So, given above, how can I approach towards finding smallest substring of string1 that contains all the characters from string 2?
To see more details including working code, check my blog post at:
http://www.leetcode.com/2010/11/finding-minimum-window-in-s-which.html
To help illustrate this approach, I use an example: string1 = "acbbaca" and string2 = "aba". Here, we also use the term "window", which means a contiguous block of characters from string1 (could be interchanged with the term substring).
i) string1 = "acbbaca" and string2 = "aba".
ii) The first minimum window is found.
Notice that we cannot advance begin
pointer as hasFound['a'] ==
needToFind['a'] == 2. Advancing would
mean breaking the constraint.
iii) The second window is found. begin
pointer still points to the first
element 'a'. hasFound['a'] (3) is
greater than needToFind['a'] (2). We
decrement hasFound['a'] by one and
advance begin pointer to the right.
iv) We skip 'c' since it is not found
in string2. Begin pointer now points to 'b'.
hasFound['b'] (2) is greater than
needToFind['b'] (1). We decrement
hasFound['b'] by one and advance begin
pointer to the right.
v) Begin pointer now points to the
next 'b'. hasFound['b'] (1) is equal
to needToFind['b'] (1). We stop
immediately and this is our newly
found minimum window.
The idea is mainly based on the help of two pointers (begin and end position of the window) and two tables (needToFind and hasFound) while traversing string1. needToFind stores the total count of a character in string2 and hasFound stores the total count of a character met so far. We also use a count variable to store the total characters in string2 that's met so far (not counting characters where hasFound[x] exceeds needToFind[x]). When count equals string2's length, we know a valid window is found.
Each time we advance the end pointer (pointing to an element x), we increment hasFound[x] by one. We also increment count by one if hasFound[x] is less than or equal to needToFind[x]. Why? When the constraint is met (that is, count equals to string2's size), we immediately advance begin pointer as far right as possible while maintaining the constraint.
How do we check if it is maintaining the constraint? Assume that begin points to an element x, we check if hasFound[x] is greater than needToFind[x]. If it is, we can decrement hasFound[x] by one and advancing begin pointer without breaking the constraint. On the other hand, if it is not, we stop immediately as advancing begin pointer breaks the window constraint.
Finally, we check if the minimum window length is less than the current minimum. Update the current minimum if a new minimum is found.
Essentially, the algorithm finds the first window that satisfies the constraint, then continue maintaining the constraint throughout.
You can do a histogram sweep in O(N+M) time and O(1) space where N is the number of characters in the first string and M is the number of characters in the second.
It works like this:
Make a histogram of the second string's characters (key operation is hist2[ s2[i] ]++).
Make a cumulative histogram of the first string's characters until that histogram contains every character that the second string's histogram contains (which I will call "the histogram condition").
Then move forwards on the first string, subtracting from the histogram, until it fails to meet the histogram condition. Mark that bit of the first string (before the final move) as your tentative substring.
Move the front of the substring forwards again until you meet the histogram condition again. Move the end forwards until it fails again. If this is a shorter substring than the first, mark that as your tentative substring.
Repeat until you've passed through the entire first string.
The marked substring is your answer.
Note that by varying the check you use on the histogram condition, you can choose either to have the same set of characters as the second string, or at least as many characters of each type. (Its just the difference between a[i]>0 && b[i]>0 and a[i]>=b[i].)
You can speed up the histogram checks if you keep a track of which condition is not satisfied when you're trying to satisfy it, and checking only the thing that you decrement when you're trying to break it. (On the initial buildup, you count how many items you've satisfied, and increment that count every time you add a new character that takes the condition from false to true.)
Here's an O(n) solution. The basic idea is simple: for each starting index, find the least ending index such that the substring contains all of the necessary letters. The trick is that the least ending index increases over the course of the function, so with a little data structure support, we consider each character at most twice.
In Python:
from collections import defaultdict
def smallest(s1, s2):
assert s2 != ''
d = defaultdict(int)
nneg = [0] # number of negative entries in d
def incr(c):
d[c] += 1
if d[c] == 0:
nneg[0] -= 1
def decr(c):
if d[c] == 0:
nneg[0] += 1
d[c] -= 1
for c in s2:
decr(c)
minlen = len(s1) + 1
j = 0
for i in xrange(len(s1)):
while nneg[0] > 0:
if j >= len(s1):
return minlen
incr(s1[j])
j += 1
minlen = min(minlen, j - i)
decr(s1[i])
return minlen
I received the same interview question. I am a C++ candidate but I was in a position to code relatively fast in JAVA.
Java [Courtesy : Sumod Mathilakath]
import java.io.*;
import java.util.*;
class UserMainCode
{
public String GetSubString(String input1,String input2){
// Write code here...
return find(input1, input2);
}
private static boolean containsPatternChar(int[] sCount, int[] pCount) {
for(int i=0;i<256;i++) {
if(pCount[i]>sCount[i])
return false;
}
return true;
}
public static String find(String s, String p) {
if (p.length() > s.length())
return null;
int[] pCount = new int[256];
int[] sCount = new int[256];
// Time: O(p.lenght)
for(int i=0;i<p.length();i++) {
pCount[(int)(p.charAt(i))]++;
sCount[(int)(s.charAt(i))]++;
}
int i = 0, j = p.length(), min = Integer.MAX_VALUE;
String res = null;
// Time: O(s.lenght)
while (j < s.length()) {
if (containsPatternChar(sCount, pCount)) {
if ((j - i) < min) {
min = j - i;
res = s.substring(i, j);
// This is the smallest possible substring.
if(min==p.length())
break;
// Reduce the window size.
sCount[(int)(s.charAt(i))]--;
i++;
}
} else {
sCount[(int)(s.charAt(j))]++;
// Increase the window size.
j++;
}
}
System.out.println(res);
return res;
}
}
C++ [Courtesy : sundeepblue]
#include <iostream>
#include <vector>
#include <string>
#include <climits>
using namespace std;
string find_minimum_window(string s, string t) {
if(s.empty() || t.empty()) return;
int ns = s.size(), nt = t.size();
vector<int> total(256, 0);
vector<int> sofar(256, 0);
for(int i=0; i<nt; i++)
total[t[i]]++;
int L = 0, R;
int minL = 0; //gist2
int count = 0;
int min_win_len = INT_MAX;
for(R=0; R<ns; R++) { // gist0, a big for loop
if(total[s[R]] == 0) continue;
else sofar[s[R]]++;
if(sofar[s[R]] <= total[s[R]]) // gist1, <= not <
count++;
if(count == nt) { // POS1
while(true) {
char c = s[L];
if(total[c] == 0) { L++; }
else if(sofar[c] > total[c]) {
sofar[c]--;
L++;
}
else break;
}
if(R - L + 1 < min_win_len) { // this judge should be inside POS1
min_win_len = R - L + 1;
minL = L;
}
}
}
string res;
if(count == nt) // gist3, cannot forget this.
res = s.substr(minL, min_win_len); // gist4, start from "minL" not "L"
return res;
}
int main() {
string s = "abdccdedca";
cout << find_minimum_window(s, "acd");
}
Erlang [Courtesy : wardbekker]
-module(leetcode).
-export([min_window/0]).
%% Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity O(n).
%% For example,
%% S = "ADOBECODEBANC"
%% T = "ABC"
%% Minimum window is "BANC".
%% Note:
%% If there is no such window in S that covers all characters in T, return the emtpy string "".
%% If there are multiple such windows, you are guaranteed that there will always be only one unique minimum window in S.
min_window() ->
"eca" = min_window("cabeca", "cae"),
"eca" = min_window("cfabeca", "cae"),
"aec" = min_window("cabefgecdaecf", "cae"),
"cwae" = min_window("cabwefgewcwaefcf", "cae"),
"BANC" = min_window("ADOBECODEBANC", "ABC"),
ok.
min_window(T, S) ->
min_window(T, S, []).
min_window([], _T, MinWindow) ->
MinWindow;
min_window([H | Rest], T, MinWindow) ->
NewMinWindow = case lists:member(H, T) of
true ->
MinWindowFound = fullfill_window(Rest, lists:delete(H, T), [H]),
case length(MinWindow) == 0 orelse (length(MinWindow) > length(MinWindowFound)
andalso length(MinWindowFound) > 0) of
true ->
MinWindowFound;
false ->
MinWindow
end;
false ->
MinWindow
end,
min_window(Rest, T, NewMinWindow).
fullfill_window(_, [], Acc) ->
%% window completed
Acc;
fullfill_window([], _T, _Acc) ->
%% no window found
"";
fullfill_window([H | Rest], T, Acc) ->
%% completing window
case lists:member(H, T) of
true ->
fullfill_window(Rest, lists:delete(H, T), Acc ++ [H]);
false ->
fullfill_window(Rest, T, Acc ++ [H])
end.
REF:
http://articles.leetcode.com/finding-minimum-window-in-s-which/#comment-511216
http://www.mif.vu.lt/~valdas/ALGORITMAI/LITERATURA/Cormen/Cormen.pdf
Please have a look at this as well:
//-----------------------------------------------------------------------
bool IsInSet(char ch, char* cSet)
{
char* cSetptr = cSet;
int index = 0;
while (*(cSet+ index) != '\0')
{
if(ch == *(cSet+ index))
{
return true;
}
++index;
}
return false;
}
void removeChar(char ch, char* cSet)
{
bool bShift = false;
int index = 0;
while (*(cSet + index) != '\0')
{
if( (ch == *(cSet + index)) || bShift)
{
*(cSet + index) = *(cSet + index + 1);
bShift = true;
}
++index;
}
}
typedef struct subStr
{
short iStart;
short iEnd;
short szStr;
}ss;
char* subStringSmallest(char* testStr, char* cSet)
{
char* subString = NULL;
int iSzSet = strlen(cSet) + 1;
int iSzString = strlen(testStr)+ 1;
char* cSetBackUp = new char[iSzSet];
memcpy((void*)cSetBackUp, (void*)cSet, iSzSet);
int iStartIndx = -1;
int iEndIndx = -1;
int iIndexStartNext = -1;
std::vector<ss> subStrVec;
int index = 0;
while( *(testStr+index) != '\0' )
{
if (IsInSet(*(testStr+index), cSetBackUp))
{
removeChar(*(testStr+index), cSetBackUp);
if(iStartIndx < 0)
{
iStartIndx = index;
}
else if( iIndexStartNext < 0)
iIndexStartNext = index;
else
;
if (strlen(cSetBackUp) == 0 )
{
iEndIndx = index;
if( iIndexStartNext == -1)
break;
else
{
index = iIndexStartNext;
ss stemp = {iStartIndx, iEndIndx, (iEndIndx-iStartIndx + 1)};
subStrVec.push_back(stemp);
iStartIndx = iEndIndx = iIndexStartNext = -1;
memcpy((void*)cSetBackUp, (void*)cSet, iSzSet);
continue;
}
}
}
else
{
if (IsInSet(*(testStr+index), cSet))
{
if(iIndexStartNext < 0)
iIndexStartNext = index;
}
}
++index;
}
int indexSmallest = 0;
for(int indexVec = 0; indexVec < subStrVec.size(); ++indexVec)
{
if(subStrVec[indexSmallest].szStr > subStrVec[indexVec].szStr)
indexSmallest = indexVec;
}
subString = new char[(subStrVec[indexSmallest].szStr) + 1];
memcpy((void*)subString, (void*)(testStr+ subStrVec[indexSmallest].iStart), subStrVec[indexSmallest].szStr);
memset((void*)(subString + subStrVec[indexSmallest].szStr), 0, 1);
delete[] cSetBackUp;
return subString;
}
//--------------------------------------------------------------------
Edit: apparently there's an O(n) algorithm (cf. algorithmist's answer). Obviously this have this will beat the [naive] baseline described below!
Too bad I gotta go... I'm a bit suspicious that we can get O(n). I'll check in tomorrow to see the winner ;-) Have fun!
Tentative algorithm:
The general idea is to sequentially try and use a character from str2 found in str1 as the start of a search (in either/both directions) of all the other letters of str2. By keeping a "length of best match so far" value, we can abort searches when they exceed this. Other heuristics can probably be used to further abort suboptimal (so far) solutions. The choice of the order of the starting letters in str1 matters much; it is suggested to start with the letter(s) of str1 which have the lowest count and to try with the other letters, of an increasing count, in subsequent attempts.
[loose pseudo-code]
- get count for each letter/character in str1 (number of As, Bs etc.)
- get count for each letter in str2
- minLen = length(str1) + 1 (the +1 indicates you're not sure all chars of
str2 are in str1)
- Starting with the letter from string2 which is found the least in string1,
look for other letters of Str2, in either direction of str1, until you've
found them all (or not, at which case response = impossible => done!).
set x = length(corresponding substring of str1).
- if (x < minLen),
set minlen = x,
also memorize the start/len of the str1 substring.
- continue trying with other letters of str1 (going the up the frequency
list in str1), but abort search as soon as length(substring of strl)
reaches or exceed minLen.
We can find a few other heuristics that would allow aborting a
particular search, based on [pre-calculated ?] distance between a given
letter in str1 and some (all?) of the letters in str2.
- the overall search terminates when minLen = length(str2) or when
we've used all letters of str1 (which match one letter of str2)
as a starting point for the search
Here is Java implementation
public static String shortestSubstrContainingAllChars(String input, String target) {
int needToFind[] = new int[256];
int hasFound[] = new int[256];
int totalCharCount = 0;
String result = null;
char[] targetCharArray = target.toCharArray();
for (int i = 0; i < targetCharArray.length; i++) {
needToFind[targetCharArray[i]]++;
}
char[] inputCharArray = input.toCharArray();
for (int begin = 0, end = 0; end < inputCharArray.length; end++) {
if (needToFind[inputCharArray[end]] == 0) {
continue;
}
hasFound[inputCharArray[end]]++;
if (hasFound[inputCharArray[end]] <= needToFind[inputCharArray[end]]) {
totalCharCount ++;
}
if (totalCharCount == target.length()) {
while (needToFind[inputCharArray[begin]] == 0
|| hasFound[inputCharArray[begin]] > needToFind[inputCharArray[begin]]) {
if (hasFound[inputCharArray[begin]] > needToFind[inputCharArray[begin]]) {
hasFound[inputCharArray[begin]]--;
}
begin++;
}
String substring = input.substring(begin, end + 1);
if (result == null || result.length() > substring.length()) {
result = substring;
}
}
}
return result;
}
Here is the Junit Test
#Test
public void shortestSubstringContainingAllCharsTest() {
String result = StringUtil.shortestSubstrContainingAllChars("acbbaca", "aba");
assertThat(result, equalTo("baca"));
result = StringUtil.shortestSubstrContainingAllChars("acbbADOBECODEBANCaca", "ABC");
assertThat(result, equalTo("BANC"));
result = StringUtil.shortestSubstrContainingAllChars("this is a test string", "tist");
assertThat(result, equalTo("t stri"));
}
//[ShortestSubstring.java][1]
public class ShortestSubstring {
public static void main(String[] args) {
String input1 = "My name is Fran";
String input2 = "rim";
System.out.println(getShortestSubstring(input1, input2));
}
private static String getShortestSubstring(String mainString, String toBeSearched) {
int mainStringLength = mainString.length();
int toBeSearchedLength = toBeSearched.length();
if (toBeSearchedLength > mainStringLength) {
throw new IllegalArgumentException("search string cannot be larger than main string");
}
for (int j = 0; j < mainStringLength; j++) {
for (int i = 0; i <= mainStringLength - toBeSearchedLength; i++) {
String substring = mainString.substring(i, i + toBeSearchedLength);
if (checkIfMatchFound(substring, toBeSearched)) {
return substring;
}
}
toBeSearchedLength++;
}
return null;
}
private static boolean checkIfMatchFound(String substring, String toBeSearched) {
char[] charArraySubstring = substring.toCharArray();
char[] charArrayToBeSearched = toBeSearched.toCharArray();
int count = 0;
for (int i = 0; i < charArraySubstring.length; i++) {
for (int j = 0; j < charArrayToBeSearched.length; j++) {
if (String.valueOf(charArraySubstring[i]).equalsIgnoreCase(String.valueOf(charArrayToBeSearched[j]))) {
count++;
}
}
}
return count == charArrayToBeSearched.length;
}
}
This is an approach using prime numbers to avoid one loop, and replace it with multiplications. Several other minor optimizations can be made.
Assign a unique prime number to any of the characters that you want to find, and 1 to the uninteresting characters.
Find the product of a matching string by multiplying the prime number with the number of occurrences it should have. Now this product can only be found if the same prime factors are used.
Search the string from the beginning, multiplying the respective prime number as you move into a running product.
If the number is greater than the correct sum, remove the first character and divide its prime number out of your running product.
If the number is less than the correct sum, include the next character and multiply it into your running product.
If the number is the same as the correct sum you have found a match, slide beginning and end to next character and continue searching for other matches.
Decide which of the matches is the shortest.
Gist
charcount = { 'a': 3, 'b' : 1 };
str = "kjhdfsbabasdadaaaaasdkaaajbajerhhayeom"
def find (c, s):
Ns = len (s)
C = list (c.keys ())
D = list (c.values ())
# prime numbers assigned to the first 25 chars
prmsi = [ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89 , 97]
# primes used in the key, all other set to 1
prms = []
Cord = [ord(c) - ord('a') for c in C]
for e,p in enumerate(prmsi):
if e in Cord:
prms.append (p)
else:
prms.append (1)
# Product of match
T = 1
for c,d in zip(C,D):
p = prms[ord (c) - ord('a')]
T *= p**d
print ("T=", T)
t = 1 # product of current string
f = 0
i = 0
matches = []
mi = 0
mn = Ns
mm = 0
while i < Ns:
k = prms[ord(s[i]) - ord ('a')]
t *= k
print ("testing:", s[f:i+1])
if (t > T):
# included too many chars: move start
t /= prms[ord(s[f]) - ord('a')] # remove first char, usually division by 1
f += 1 # increment start position
t /= k # will be retested, could be replaced with bool
elif t == T:
# found match
print ("FOUND match:", s[f:i+1])
matches.append (s[f:i+1])
if (i - f) < mn:
mm = mi
mn = i - f
mi += 1
t /= prms[ord(s[f]) - ord('a')] # remove first matching char
# look for next match
i += 1
f += 1
else:
# no match yet, keep searching
i += 1
return (mm, matches)
print (find (charcount, str))
(note: this answer was originally posted to a duplicate question, the original answer is now deleted.)
C# Implementation:
public static Tuple<int, int> FindMinSubstringWindow(string input, string pattern)
{
Tuple<int, int> windowCoords = new Tuple<int, int>(0, input.Length - 1);
int[] patternHist = new int[256];
for (int i = 0; i < pattern.Length; i++)
{
patternHist[pattern[i]]++;
}
int[] inputHist = new int[256];
int minWindowLength = int.MaxValue;
int count = 0;
for (int begin = 0, end = 0; end < input.Length; end++)
{
// Skip what's not in pattern.
if (patternHist[input[end]] == 0)
{
continue;
}
inputHist[input[end]]++;
// Count letters that are in pattern.
if (inputHist[input[end]] <= patternHist[input[end]])
{
count++;
}
// Window found.
if (count == pattern.Length)
{
// Remove extra instances of letters from pattern
// or just letters that aren't part of the pattern
// from the beginning.
while (patternHist[input[begin]] == 0 ||
inputHist[input[begin]] > patternHist[input[begin]])
{
if (inputHist[input[begin]] > patternHist[input[begin]])
{
inputHist[input[begin]]--;
}
begin++;
}
// Current window found.
int windowLength = end - begin + 1;
if (windowLength < minWindowLength)
{
windowCoords = new Tuple<int, int>(begin, end);
minWindowLength = windowLength;
}
}
}
if (count == pattern.Length)
{
return windowCoords;
}
return null;
}
I've implemented it using Python3 at O(N) efficiency:
def get(s, alphabet="abc"):
seen = {}
for c in alphabet:
seen[c] = 0
seen[s[0]] = 1
start = 0
end = 0
shortest_s = 0
shortest_e = 99999
while end + 1 < len(s):
while seen[s[start]] > 1:
seen[s[start]] -= 1
start += 1
# Constant time check:
if sum(seen.values()) == len(alphabet) and all(v == 1 for v in seen.values()) and \
shortest_e - shortest_s > end - start:
shortest_s = start
shortest_e = end
end += 1
seen[s[end]] += 1
return s[shortest_s: shortest_e + 1]
print(get("abbcac")) # Expected to return "bca"
String s = "xyyzyzyx";
String s1 = "xyz";
String finalString ="";
Map<Character,Integer> hm = new HashMap<>();
if(s1!=null && s!=null && s.length()>s1.length()){
for(int i =0;i<s1.length();i++){
if(hm.get(s1.charAt(i))!=null){
int k = hm.get(s1.charAt(i))+1;
hm.put(s1.charAt(i), k);
}else
hm.put(s1.charAt(i), 1);
}
Map<Character,Integer> t = new HashMap<>();
int start =-1;
for(int j=0;j<s.length();j++){
if(hm.get(s.charAt(j))!=null){
if(t.get(s.charAt(j))!=null){
if(t.get(s.charAt(j))!=hm.get(s.charAt(j))){
int k = t.get(s.charAt(j))+1;
t.put(s.charAt(j), k);
}
}else{
t.put(s.charAt(j), 1);
if(start==-1){
if(j+s1.length()>s.length()){
break;
}
start = j;
}
}
if(hm.equals(t)){
t = new HashMap<>();
if(finalString.length()<s.substring(start,j+1).length());
{
finalString=s.substring(start,j+1);
}
j=start;
start=-1;
}
}
}
JavaScript solution in bruteforce way:
function shortestSubStringOfUniqueChars(s){
var uniqueArr = [];
for(let i=0; i<s.length; i++){
if(uniqueArr.indexOf(s.charAt(i)) <0){
uniqueArr.push(s.charAt(i));
}
}
let windoww = uniqueArr.length;
while(windoww < s.length){
for(let i=0; i<s.length - windoww; i++){
let match = true;
let tempArr = [];
for(let j=0; j<uniqueArr.length; j++){
if(uniqueArr.indexOf(s.charAt(i+j))<0){
match = false;
break;
}
}
let checkStr
if(match){
checkStr = s.substr(i, windoww);
for(let j=0; j<uniqueArr.length; j++){
if(uniqueArr.indexOf(checkStr.charAt(j))<0){
match = false;
break;
}
}
}
if(match){
return checkStr;
}
}
windoww = windoww + 1;
}
}
console.log(shortestSubStringOfUniqueChars("ABA"));
# Python implementation
s = input('Enter the string : ')
s1 = input('Enter the substring to search : ')
l = [] # List to record all the matching combinations
check = all([char in s for char in s1])
if check == True:
for i in range(len(s1),len(s)+1) :
for j in range(0,i+len(s1)+2):
if (i+j) < len(s)+1:
cnt = 0
b = all([char in s[j:i+j] for char in s1])
if (b == True) :
l.append(s[j:i+j])
print('The smallest substring containing',s1,'is',l[0])
else:
print('Please enter a valid substring')
Java code for the approach discussed above:
private static Map<Character, Integer> frequency;
private static Set<Character> charsCovered;
private static Map<Character, Integer> encountered;
/**
* To set the first match index as an intial start point
*/
private static boolean hasStarted = false;
private static int currentStartIndex = 0;
private static int finalStartIndex = 0;
private static int finalEndIndex = 0;
private static int minLen = Integer.MAX_VALUE;
private static int currentLen = 0;
/**
* Whether we have already found the match and now looking for other
* alternatives.
*/
private static boolean isFound = false;
private static char currentChar;
public static String findSmallestSubStringWithAllChars(String big, String small) {
if (null == big || null == small || big.isEmpty() || small.isEmpty()) {
return null;
}
frequency = new HashMap<Character, Integer>();
instantiateFrequencyMap(small);
charsCovered = new HashSet<Character>();
int charsToBeCovered = frequency.size();
encountered = new HashMap<Character, Integer>();
for (int i = 0; i < big.length(); i++) {
currentChar = big.charAt(i);
if (frequency.containsKey(currentChar) && !isFound) {
if (!hasStarted && !isFound) {
hasStarted = true;
currentStartIndex = i;
}
updateEncounteredMapAndCharsCoveredSet(currentChar);
if (charsCovered.size() == charsToBeCovered) {
currentLen = i - currentStartIndex;
isFound = true;
updateMinLength(i);
}
} else if (frequency.containsKey(currentChar) && isFound) {
updateEncounteredMapAndCharsCoveredSet(currentChar);
if (currentChar == big.charAt(currentStartIndex)) {
encountered.put(currentChar, encountered.get(currentChar) - 1);
currentStartIndex++;
while (currentStartIndex < i) {
if (encountered.containsKey(big.charAt(currentStartIndex))
&& encountered.get(big.charAt(currentStartIndex)) > frequency.get(big
.charAt(currentStartIndex))) {
encountered.put(big.charAt(currentStartIndex),
encountered.get(big.charAt(currentStartIndex)) - 1);
} else if (encountered.containsKey(big.charAt(currentStartIndex))) {
break;
}
currentStartIndex++;
}
}
currentLen = i - currentStartIndex;
updateMinLength(i);
}
}
System.out.println("start: " + finalStartIndex + " finalEnd : " + finalEndIndex);
return big.substring(finalStartIndex, finalEndIndex + 1);
}
private static void updateMinLength(int index) {
if (minLen > currentLen) {
minLen = currentLen;
finalStartIndex = currentStartIndex;
finalEndIndex = index;
}
}
private static void updateEncounteredMapAndCharsCoveredSet(Character currentChar) {
if (encountered.containsKey(currentChar)) {
encountered.put(currentChar, encountered.get(currentChar) + 1);
} else {
encountered.put(currentChar, 1);
}
if (encountered.get(currentChar) >= frequency.get(currentChar)) {
charsCovered.add(currentChar);
}
}
private static void instantiateFrequencyMap(String str) {
for (char c : str.toCharArray()) {
if (frequency.containsKey(c)) {
frequency.put(c, frequency.get(c) + 1);
} else {
frequency.put(c, 1);
}
}
}
public static void main(String[] args) {
String big = "this is a test string";
String small = "tist";
System.out.println("len: " + big.length());
System.out.println(findSmallestSubStringWithAllChars(big, small));
}
def minimum_window(s, t, min_length = 100000):
d = {}
for x in t:
if x in d:
d[x]+= 1
else:
d[x] = 1
tot = sum([y for x,y in d.iteritems()])
l = []
ind = 0
for i,x in enumerate(s):
if ind == 1:
l = l + [x]
if x in d:
tot-=1
if not l:
ind = 1
l = [x]
if tot == 0:
if len(l)<min_length:
min_length = len(l)
min_length = minimum_window(s[i+1:], t, min_length)
return min_length
l_s = "ADOBECODEBANC"
t_s = "ABC"
min_length = minimum_window(l_s, t_s)
if min_length == 100000:
print "Not found"
else:
print min_length
I am struggling to find/create an algorithm that can determine the pronounceability of random 5 letter combinations.
The closest thing I've found so far is from this 3 year old StackOverflow thread:
Measure the pronounceability of a word?
<?php
// Score: 1
echo pronounceability('namelet') . "\n";
// Score: 0.71428571428571
echo pronounceability('nameoic') . "\n";
function pronounceability($word) {
static $vowels = array
(
'a',
'e',
'i',
'o',
'u',
'y'
);
static $composites = array
(
'mm',
'll',
'th',
'ing'
);
if (!is_string($word)) return false;
// Remove non letters and put in lowercase
$word = preg_replace('/[^a-z]/i', '', $word);
$word = strtolower($word);
// Special case
if ($word == 'a') return 1;
$len = strlen($word);
// Let's not parse an empty string
if ($len == 0) return 0;
$score = 0;
$pos = 0;
while ($pos < $len) {
// Check if is allowed composites
foreach ($composites as $comp) {
$complen = strlen($comp);
if (($pos + $complen) < $len) {
$check = substr($word, $pos, $complen);
if ($check == $comp) {
$score += $complen;
$pos += $complen;
continue 2;
}
}
}
// Is it a vowel? If so, check if previous wasn't a vowel too.
if (in_array($word[$pos], $vowels)) {
if (($pos - 1) >= 0 && !in_array($word[$pos - 1], $vowels)) {
$score += 1;
$pos += 1;
continue;
}
} else { // Not a vowel, check if next one is, or if is end of word
if (($pos + 1) < $len && in_array($word[$pos + 1], $vowels)) {
$score += 2;
$pos += 2;
continue;
} elseif (($pos + 1) == $len) {
$score += 1;
break;
}
}
$pos += 1;
}
return $score / $len;
}
?>
... but it is far from perfect, giving some rather strange false positives:
Using this function, all of the following rate as pronounceable, (above 7/10)
ZTEDA
LLFDA
MMGDA
THHDA
RTHDA
XYHDA
VQIDA
Can someone smarter than me tweek this algorithm perhaps so that:
'MM', 'LL', and 'TH' are only valid when followed or preceeded by a
vowel?
3 or more consonants in a row is a no-no, (except when the first or
last is an 'R' or 'L')
any other refinements you can think of...
(I have done a fair amount of research/googling, and this seems to be the main pronounceability function that everyone has been referencing/using for the last 3 years, so I'm sure an updated, more refined version would be appreciated by the wider community, not just me!).
Based on a suggestion on the linked question to "Use a Markov model on letters"
Use a Markov model (on letters, not words, of course). The probability of a word is a pretty good proxy for ease of pronunciation.
I thought I would try it out and had some success.
My Methodology
I copied a list of real 5-letter words into a file to serve as my dataset (here...um, actually here).
Then I use a Hidden Markov model (based on One-grams, Bi-grams, and Tri-grams) to predict how likely a target word would appear in that dataset.
(Better results could be achieved with some sort of phonetic transcription as one of the steps.)
First, I calculate the probabilities of character sequences in the dataset.
For example, if 'A' occurs 50 times, and there is only 250 characters in the dataset, then 'A' has a 50/250 or .2 probability.
Do the same for the bigrams 'AB', 'AC', ...
Do the same for the trigrams 'ABC', 'ABD', ...
Basically, my score for the word "ABCDE" is composed of:
prob( 'A' )
prob( 'B' )
prob( 'C' )
prob( 'D' )
prob( 'E' )
prob( 'AB' )
prob( 'BC' )
prob( 'CD' )
prob( 'DE' )
prob( 'ABC' )
prob( 'BCD' )
prob( 'CDE' )
You could multiply all of these together to get the estimated probability of the target word appearing in the dataset, (but that is very small).
So instead, we take the logs of each and add them together.
Now we have a score which estimates how likely our target word would appear in the dataset.
My code
I have coded this is C#, and find that a score greater than negative 160 is pretty good.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Pronouncability
{
class Program
{
public static char[] alphabet = new char[]{ 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z' };
public static List<string> wordList = loadWordList(); //Dataset of 5-letter words
public static Random rand = new Random();
public const double SCORE_LIMIT = -160.00;
/// <summary>
/// Generates random words, until 100 of them are better than
/// the SCORE_LIMIT based on a statistical score.
/// </summary>
public static void Main(string[] args)
{
Dictionary<Tuple<char, char, char>, int> trigramCounts = new Dictionary<Tuple<char, char, char>, int>();
Dictionary<Tuple<char, char>, int> bigramCounts = new Dictionary<Tuple<char, char>, int>();
Dictionary<char, int> onegramCounts = new Dictionary<char, int>();
calculateProbabilities(onegramCounts, bigramCounts, trigramCounts);
double totalTrigrams = (double)trigramCounts.Values.Sum();
double totalBigrams = (double)bigramCounts.Values.Sum();
double totalOnegrams = (double)onegramCounts.Values.Sum();
SortedList<double, string> randomWordsScores = new SortedList<double, string>();
while( randomWordsScores.Count < 100 )
{
string randStr = getRandomWord();
if (!randomWordsScores.ContainsValue(randStr))
{
double score = getLikelyhood(randStr,trigramCounts, bigramCounts, onegramCounts, totalTrigrams, totalBigrams, totalOnegrams);
if (score > SCORE_LIMIT)
{
randomWordsScores.Add(score, randStr);
}
}
}
//Right now randomWordsScores contains 100 random words which have
//a better score than the SCORE_LIMIT, sorted from worst to best.
}
/// <summary>
/// Generates a random 5-letter word
/// </summary>
public static string getRandomWord()
{
char c0 = (char)rand.Next(65, 90);
char c1 = (char)rand.Next(65, 90);
char c2 = (char)rand.Next(65, 90);
char c3 = (char)rand.Next(65, 90);
char c4 = (char)rand.Next(65, 90);
return "" + c0 + c1 + c2 + c3 + c4;
}
/// <summary>
/// Returns a score for how likely a given word is, based on given trigrams, bigrams, and one-grams
/// </summary>
public static double getLikelyhood(string wordToScore, Dictionary<Tuple<char, char,char>, int> trigramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<char, int> onegramCounts, double totalTrigrams, double totalBigrams, double totalOnegrams)
{
wordToScore = wordToScore.ToUpper();
char[] letters = wordToScore.ToCharArray();
Tuple<char, char>[] bigrams = new Tuple<char, char>[]{
new Tuple<char,char>( wordToScore[0], wordToScore[1] ),
new Tuple<char,char>( wordToScore[1], wordToScore[2] ),
new Tuple<char,char>( wordToScore[2], wordToScore[3] ),
new Tuple<char,char>( wordToScore[3], wordToScore[4] )
};
Tuple<char, char, char>[] trigrams = new Tuple<char, char, char>[]{
new Tuple<char,char,char>( wordToScore[0], wordToScore[1], wordToScore[2] ),
new Tuple<char,char,char>( wordToScore[1], wordToScore[2], wordToScore[3] ),
new Tuple<char,char,char>( wordToScore[2], wordToScore[3], wordToScore[4] ),
};
double score = 0;
foreach (char c in letters)
{
score += Math.Log((((double)onegramCounts[c]) / totalOnegrams));
}
foreach (Tuple<char, char> pair in bigrams)
{
score += Math.Log((((double)bigramCounts[pair]) / totalBigrams));
}
foreach (Tuple<char, char, char> trio in trigrams)
{
score += 5.0*Math.Log((((double)trigramCounts[trio]) / totalTrigrams));
}
return score;
}
/// <summary>
/// Build the probability tables based on the dataset (WordList)
/// </summary>
public static void calculateProbabilities(Dictionary<char, int> onegramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<Tuple<char, char, char>, int> trigramCounts)
{
foreach (char c1 in alphabet)
{
foreach (char c2 in alphabet)
{
foreach( char c3 in alphabet)
{
trigramCounts[new Tuple<char, char, char>(c1, c2, c3)] = 1;
}
}
}
foreach( char c1 in alphabet)
{
foreach( char c2 in alphabet)
{
bigramCounts[ new Tuple<char,char>(c1,c2) ] = 1;
}
}
foreach (char c1 in alphabet)
{
onegramCounts[c1] = 1;
}
foreach (string word in wordList)
{
for (int pos = 0; pos < 3; pos++)
{
trigramCounts[new Tuple<char, char, char>(word[pos], word[pos + 1], word[pos + 2])]++;
}
for (int pos = 0; pos < 4; pos++)
{
bigramCounts[new Tuple<char, char>(word[pos], word[pos + 1])]++;
}
for (int pos = 0; pos < 5; pos++)
{
onegramCounts[word[pos]]++;
}
}
}
/// <summary>
/// Get the dataset (WordList) from file.
/// </summary>
public static List<string> loadWordList()
{
string filePath = "WordList.txt";
string text = File.ReadAllText(filePath);
List<string> result = text.Split(' ').ToList();
return result;
}
}
}
In my example, I scale the trigram probabilities by 5.
I also add one to all of the counts, so we don't multiply by zero.
Final notes
I'm not a php programmer, but the technique is pretty easy to implement.
Play around with some scaling factors, try different datasets, or add in some other checks like what you suggested above.
How about generating a reasonably pronounceable combination from the start? I have done something where I generate a random Soundex code, and work back from that to a (usually) pronounceable original.
If anyone's looking for a way to do this with Node.js, I found a module called pronouncable that seems to implement what Xantix's answer describes.
npm i pronounceable
You can test in without installing anything on RunKit.
I am using the following PHP code to calculate a CRN for BPay:
<?php
function LuhnCalc($number) {
$chars = array_reverse(str_split($number, 1));
$odd = array_intersect_key($chars, array_fill_keys(range(1, count($chars), 2), null));
$even = array_intersect_key($chars, array_fill_keys(range(0, count($chars), 2), null));
$even = array_map(function($n) { return ($n >= 5)?2 * $n - 9:2 * $n; }, $even);
$total = array_sum($odd) + array_sum($even);
return ((floor($total / 10) + 1) * 10 - $total) % 10;
}
print LuhnCalc($_GET['num']);
?>
However it seems that BPAY is version 5 of MOD 10, for which I can't find any documentation. It seems to not be the same as MOD10.
The following numbers where tested:
2005,1597,3651,0584,9675
bPAY
2005 = 20052
1597 = 15976
3651 = 36514
0584 = 05840
9675 = 96752
MY CODE
2005 = 20057
1597 = 15974
3651 = 36517
0584 = 05843
9675 = 96752
As you can see, none of them match the BPAY numbers.
This PHP function will generate BPay reference numbers based on the mod10 version 5 algorithm.
Who knows why BPay can't add this to their website. I only found an explanation by googling finding the algorithm being called "MOD10V05" instead of "Mod 10 version 5".
function generateBpayRef($number) {
$number = preg_replace("/\D/", "", $number);
// The seed number needs to be numeric
if(!is_numeric($number)) return false;
// Must be a positive number
if($number <= 0) return false;
// Get the length of the seed number
$length = strlen($number);
$total = 0;
// For each character in seed number, sum the character multiplied by its one based array position (instead of normal PHP zero based numbering)
for($i = 0; $i < $length; $i++) $total += $number{$i} * ($i + 1);
// The check digit is the result of the sum total from above mod 10
$checkdigit = fmod($total, 10);
// Return the original seed plus the check digit
return $number . $checkdigit;
}
Here's a way of implementing the "MOD10V5" algorithm (or "mod 10 version 5") using a t-sql user defined function in SQL server. It accepts a Customer ID up to 9 characters long, and return an 11 character CRN (Customer Reference Number).
I also prepended a version number onto the start of my CustomerID, you could do this too if you think you might end up changing it in the future.
CREATE Function [dbo].[CalculateBPayCRN]
(
#CustomerID nvarchar(9)
)
RETURNS varchar(11)
AS
BEGIN
DECLARE #NewCRN nvarchar(11)
DECLARE #Multiplier TINYINT
DECLARE #Sum int
DECLARE #SubTotal int
DECLARE #CheckDigit int
DECLARE #ReturnVal BIGINT
SELECT #Multiplier = 1
SELECT #SubTotal = 0
-- If it's less than 9 characters, pad it with 0's, then prepend a '1'
SELECT #NewCRN = '1' + right('000000000'+ rtrim(#CustomerID), 9)
-- loop through each digit in the #NewCRN, multiple it by the correct weighting and subtotal it:
WHILE #Multiplier <= LEN(#NewCRN)
BEGIN
SET #Sum = CAST(SUBSTRING(#NewCRN,#Multiplier,1) AS TINYINT) * #Multiplier
SET #SubTotal = #SubTotal + #Sum
SET #Multiplier = #Multiplier + 1
END
-- mod 10 the subtotal and the result is our check digit
SET #CheckDigit = #SubTotal % 10
SELECT #ReturnVal = #NewCRN + cast(#CheckDigit as varchar)
RETURN #ReturnVal
END
GO
Modula 10 V1 in PHP. Tested against my Windows dataflex routine and it is the same.
function generateBpayRef($number) {
//Mod 10 v1
$number = preg_replace("/\D/", "", $number);
// The seed number needs to be numeric
if(!is_numeric($number)) return false;
// Must be a positive number
if($number <= 0) return false;
$stringMemberNo = "$number";
$stringMemberNo = str_pad($stringMemberNo, 6, "0", STR_PAD_LEFT);
//echo " Padded Number is $stringMemberNo ";
$crn = $stringMemberNo;
for($i=0;$i<7;$i++){
$crnval = substr($crn,(5-$i),1);
$iPartVal = $iWeight * $crnval;
if($iPartVal>9){
//echo " Greater than 9: $iPartVal ";
$firstChar = substr($iPartVal,0,1);
$secondChar = substr($iPartVal,1,1);
$iPartVal=$firstChar+$secondChar;
//$iPartVal -= 9;
}
$iSum+=$iPartVal;
$iWeight++;
if ($iWeight>2){$iWeight=1;}
//echo " CRN: $crnval ] Weight: $iWeight ] Part: $iPartVal ] SUM: $iSum ";
}
$iSum %= 10;
if($iSum==0){
//echo " zero check is $iSum ";
//return $iSum;
}
else{
//return 10-$iSum;
$iSum=(10-$iSum);
}
//echo " Check is a $iSum ";
$BpayMemberNo = $stringMemberNo . $iSum ;
echo " New: $BpayMemberNo ";
return ($BpayMemberNo);
}
Here is a ruby class I whipped up quickly for Mod 10 v5
module Bpay
class CRN
attr_accessor :number, :crn
class << self
def calculate_for(number)
new(number).crn
end
end
def initialize(number)
#number = number
calculate
end
def calculate
raise ArgumentError, "The number '#{number}' is not valid" unless valid?
digits = number.to_s.scan(/\d/).map { |x| x.to_i }
raise ArgumentError, "The number '#{number}' must be at least 2 digits in length" if digits.size < 2
check_digit = digits.each_with_index.map { |d, i| d * (i + 1) }.inject(:+) % 10
#crn = "#{number}#{check_digit}"
end
def valid?
return false unless !!Integer(number.to_s) rescue false
return false if number.to_i <= 0
true
end
end
end
This is in C#, but this is what I have so far for BPay check digit generation:
private void btnBPayGenerate_Click(object sender, EventArgs e)
{
var originalChars = txtBPayNumber.Text.ToCharArray();
List<int> oddDigits = new List<int>();
List<int> evenDigits = new List<int>();
int oddTotal = 0, evenTotal = 0, total = 0, checkDigit ;
const int oddMultiplier = 3;
const int modulus = 10;
bool isOdd = true;
for (int x = 0; x < originalChars.Length; x++)
{
if(isOdd)
oddDigits.Add(Int32.Parse(originalChars[x].ToString()));
else
evenDigits.Add(Int32.Parse(originalChars[x].ToString()));
isOdd = !isOdd;
}
foreach (var digit in oddDigits)
oddTotal += digit;
foreach (var digit in evenDigits)
evenTotal += digit;
oddTotal = oddTotal * oddMultiplier;
total = oddTotal + evenTotal;
checkDigit = (modulus - (total % modulus));
lblBPayResult.Text = txtBPayNumber.Text + checkDigit.ToString();
}
I haven't completed testing this yet, I will post back once BPAY get back to me.
EDIT: try this: https://gist.github.com/1287893
I had to work out a version for javascript, this is what I came up with. It correctly generates the expected numbers in the original question.
var startingNumber = 2005;
var reference = startingNumber.toString();
var subTotal = 0;
for (var x = 0; x < reference.length; x++) {
subTotal += (x + 1) * reference.charAt(x);
}
var digit = subTotal % 10;
var bpayReference = reference + digit.toString();
Here is a function I created using vb.net to calculate a mod 10 version 5 check digit
Private Function CalcCheckDigit(ByRef psBaseNumber As String) As String
Dim lCheckDigit, iLoop As Integer
Dim dCalcNumber As Double
lCheckDigit = 0
dCalcNumber = 0
For iLoop = 0 To (psBaseNumber.Length - 1)
lCheckDigit = lCheckDigit + (psBaseNumber.Substring(iLoop, 1) * (iLoop + 1))
Next iLoop
lCheckDigit = lCheckDigit Mod 10
CalcCheckDigit = psBaseNumber & CStr(lCheckDigit)
End Function