Longest Common Substring with wrong character tolerance - php

I have a script I found on here that works well when looking for the Lowest Common Substring.
However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable.
For example, I want to find this string:
big yellow school bus
inside of this string:
they rode the bigyellow schook bus that afternoon
This is the code i'm currently using:
function longest_common_substring($words) {
$words = array_map('strtolower', array_map('trim', $words));
$sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
usort($words, $sort_by_strlen);
// We have to assume that each string has something in common with the first
// string (post sort), we just need to figure out what the longest common
// string is. If any string DOES NOT have something in common with the first
// string, return false.
$longest_common_substring = array();
$shortest_string = str_split(array_shift($words));
while (sizeof($shortest_string)) {
array_unshift($longest_common_substring, '');
foreach ($shortest_string as $ci => $char) {
foreach ($words as $wi => $word) {
if (!strstr($word, $longest_common_substring[0] . $char)) {
// No match
break 2;
}
}
// we found the current char in each word, so add it to the first longest_common_substring element,
// then start checking again using the next char as well
$longest_common_substring[0].= $char;
}
// We've finished looping through the entire shortest_string.
// Remove the first char and start all over. Do this until there are no more
// chars to search on.
array_shift($shortest_string);
}
// If we made it here then we've run through everything
usort($longest_common_substring, $sort_by_strlen);
return array_pop($longest_common_substring);
}
Any help is much appreciated.
UPDATE
The PHP levenshtein function is limited to 255 characters, and some of the haystacks i'm searching are 1000+ characters.

Writing this as a second answer because it's not based on my previous (bad) one at all.
This code is based on http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm and http://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms
It returns one (of potentially several) minimum-levenshtein substrings of $haystack, given $needle. Now, levenshtein distance is just one measure of edit distance and it may not actually suit your needs. 'hte' is closer on this metric to 'he' than it is to 'the'. Some of the examples I put in show the limitations of this technique. I believe this to be considerably more reliable than the previous answer I gave, but let me know how it works for you.
// utility function - returns the key of the array minimum
function array_min_key($arr)
{
$min_key = null;
$min = PHP_INT_MAX;
foreach($arr as $k => $v) {
if ($v < $min) {
$min = $v;
$min_key = $k;
}
}
return $min_key;
}
// Calculate the edit distance between two strings
function edit_distance($string1, $string2)
{
$m = strlen($string1);
$n = strlen($string2);
$d = array();
// the distance from '' to substr(string,$i)
for($i=0;$i<=$m;$i++) $d[$i][0] = $i;
for($i=0;$i<=$n;$i++) $d[0][$i] = $i;
// fill-in the edit distance matrix
for($j=1; $j<=$n; $j++)
{
for($i=1; $i<=$m; $i++)
{
// Using, for example, the levenshtein distance as edit distance
list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$string1,$string2);
$d[$i][$j] = $d[$p_i][$p_j]+$cost;
}
}
return $d[$m][$n];
}
// Helper function for edit_distance()
function levenshtein_weighting($i,$j,$d,$string1,$string2)
{
// if the two letters are equal, cost is 0
if($string1[$i-1] === $string2[$j-1]) {
return array($i-1,$j-1,0);
}
// cost we assign each operation
$cost['delete'] = 1;
$cost['insert'] = 1;
$cost['substitute'] = 1;
// cost of operation + cost to get to the substring we perform it on
$total_cost['delete'] = $d[$i-1][$j] + $cost['delete'];
$total_cost['insert'] = $d[$i][$j-1] + $cost['insert'];
$total_cost['substitute'] = $d[$i-1][$j-1] + $cost['substitute'];
// return the parent array keys of $d and the operation's cost
$min_key = array_min_key($total_cost);
if ($min_key == 'delete') {
return array($i-1,$j,$cost['delete']);
} elseif($min_key == 'insert') {
return array($i,$j-1,$cost['insert']);
} else {
return array($i-1,$j-1,$cost['substitute']);
}
}
// attempt to find the substring of $haystack most closely matching $needle
function shortest_edit_substring($needle, $haystack)
{
// initialize edit distance matrix
$m = strlen($needle);
$n = strlen($haystack);
$d = array();
for($i=0;$i<=$m;$i++) {
$d[$i][0] = $i;
$backtrace[$i][0] = null;
}
// instead of strlen, we initialize the top row to all 0's
for($i=0;$i<=$n;$i++) {
$d[0][$i] = 0;
$backtrace[0][$i] = null;
}
// same as the edit_distance calculation, but keep track of how we got there
for($j=1; $j<=$n; $j++)
{
for($i=1; $i<=$m; $i++)
{
list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$needle,$haystack);
$d[$i][$j] = $d[$p_i][$p_j]+$cost;
$backtrace[$i][$j] = array($p_i,$p_j);
}
}
// now find the minimum at the bottom row
$min_key = array_min_key($d[$m]);
$current = array($m,$min_key);
$parent = $backtrace[$m][$min_key];
// trace up path to the top row
while(! is_null($parent)) {
$current = $parent;
$parent = $backtrace[$current[0]][$current[1]];
}
// and take a substring based on those results
$start = $current[1];
$end = $min_key;
return substr($haystack,$start,$end-$start);
}
// some testing
$data = array( array('foo',' foo'), array('fat','far'), array('dat burn','rugburn'));
$data[] = array('big yellow school bus','they rode the bigyellow schook bus that afternoon');
$data[] = array('bus','they rode the bigyellow schook bus that afternoon');
$data[] = array('big','they rode the bigyellow schook bus that afternoon');
$data[] = array('nook','they rode the bigyellow schook bus that afternoon');
$data[] = array('they','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');
$data[] = array('controker','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');
foreach($data as $dat) {
$substring = shortest_edit_substring($dat[0],$dat[1]);
$dist = edit_distance($dat[0],$substring);
printf("Found |%s| in |%s|, matching |%s| with edit distance %d\n",$substring,$dat[1],$dat[0],$dist);
}

Related

Generate List of Unique Four-Digit Numbers Without Repeating Digits and Without Forward-Sequential Digits

I had a need to generate a list of four-digit numbers for use as codes. The digits should not repeat, and each next digit should not be sequential. There were some questions that were similar but not enough for me to answer. I chose to share my function instead. It did not matter if reverse numbers were in the list e.g. 1357 > 7531.
It occurred to me that it there may be an opportunity for a recursive function, possibly to return five or six-digit numbers. Improvements to my function are most welcome.
public function codeList() {
$data = [];
for ($ii=0; $ii < 10; $ii++) {
for ($jj=0; $jj < 10; $jj++) {
for ($kk=0; $kk < 10; $kk++) {
for ($ll=0; $ll < 10; $ll++) {
$str = "{$ii}{$jj}{$kk}{$ll}";
$arr = str_split($str);
if (count($arr) === count(array_unique($arr))) {
if (($arr[0] + 1 != $arr[1]) && ($arr[1] + 1 != $arr[2]) && ($arr[2] + 1 != $arr[3])) {
$data[] = $str;
}
}
}
}
}
}
return $data;
} # END FUNCTION codeList

Get lowest price on sum of combinations in given array

This code is working fine when the array length is 8 or 10 only. When we are checking this same code for more than 10 array length.it get loading not showing the results.
How do reduce my code. If you have algorithm please share. Please help me.
This program working flow:
$allowed_per_room_accommodation =[2,3,6,5,3,5,2,5,4];
$allowed_per_room_price =[10,30,60,40,30,50,20,60,80];
$search_accommodation = 10;
i am get subsets = [5,5],[5,3,2],[6,4],[6,2,2],[5,2,3],[3,2,5]
Show lowest price room and then equal of 10 accommodation; output like as [5,3,2];
<?php
$dp=array(array());
$GLOBALS['final']=[];
$GLOBALS['room_key']=[];
function display($v,$room_key)
{
$GLOBALS['final'][] = $v;
$GLOBALS['room_key'][] = $room_key;
}
function printSubsetsRec($arr, $i, $sum, $p,$dp,$room_key='')
{
// If we reached end and sum is non-zero. We print
// p[] only if arr[0] is equal to sun OR dp[0][sum]
// is true.
if ($i == 0 && $sum != 0 && $dp[0][$sum]) {
array_push($p,$arr[$i]);
array_push($room_key,$i);
display($p,$room_key);
return $p;
}
// If $sum becomes 0
if ($i == 0 && $sum == 0) {
display($p,$room_key);
return $p;
}
// If given sum can be achieved after ignoring
// current element.
if (isset($dp[$i-1][$sum])) {
// Create a new vector to store path
// if(!is_array(#$b))
// $b = array();
$b = $p;
printSubsetsRec($arr, $i-1, $sum, $b,$dp,$room_key);
}
// If given $sum can be achieved after considering
// current element.
if ($sum >= $arr[$i] && isset($dp[$i-1][$sum-$arr[$i]]))
{
if(!is_array($p))
$p = array();
if(!is_array($room_key))
$room_key = array();
array_push($p,$arr[$i]);
array_push($room_key,$i);
printSubsetsRec($arr, $i-1, $sum-$arr[$i], $p,$dp,$room_key);
}
}
// Prints all subsets of arr[0..n-1] with sum 0.
function printAllSubsets($arr, $n, $sum,$get=[])
{
if ($n == 0 || $sum < 0)
return;
// Sum 0 can always be achieved with 0 elements
// $dp = new bool*[$n];
$dp = array();
for ($i=0; $i<$n; ++$i)
{
// $dp[$i][$sum + 1]=true;
$dp[$i][0] = true;
}
// Sum arr[0] can be achieved with single element
if ($arr[0] <= $sum)
$dp[0][$arr[0]] = true;
// Fill rest of the entries in dp[][]
for ($i = 1; $i < $n; ++$i) {
for ($j = 0; $j < $sum + 1; ++$j) {
// echo $i.'d'.$j.'.ds';
$dp[$i][$j] = ($arr[$i] <= $j) ? (isset($dp[$i-1][$j])?$dp[$i-1][$j]:false) | (isset($dp[$i-1][$j-$arr[$i]])?($dp[$i-1][$j-$arr[$i]]):false) : (isset($dp[$i - 1][$j])?($dp[$i - 1][$j]):false);
}
}
if (isset($dp[$n-1][$sum]) == false) {
return "There are no subsets with";
}
$p;
printSubsetsRec($arr, $n-1, $sum, $p='',$dp);
}
$blockSize = array('2','3','6','5','3','5','2','5','4');
$blockvalue = array('10','30','60','40','30','50','20','60','80');
$blockname = array("map","compass","water","sandwich","glucose","tin","banana","apple","cheese");
$processSize = 10;
$m = count($blockSize);
$n = count($processSize);
// sum of sets in array
printAllSubsets($blockSize, $m, $processSize);
$final_subset_room = '';
$final_set_room_keys = '';
$final_set_room =[];
if($GLOBALS['room_key']){
foreach ($GLOBALS['room_key'] as $set_rooms_key => $set_rooms) {
$tot = 0;
foreach ($set_rooms as $set_rooms) {
$tot += $blockvalue[$set_rooms];
}
$final_set_room[$set_rooms_key] = $tot;
}
asort($final_set_room);
$final_set_room_first_key = key($final_set_room);
$final_all_room['set_room_keys'] = $GLOBALS['room_key'][$final_set_room_first_key];
$final_all_room_price['set_room_price'] = $final_set_room[$final_set_room_first_key];
}
if(isset($final_all_room_price)){
asort($final_all_room_price);
$final_all_room_first_key = key($final_all_room_price);
foreach ($final_all_room['set_room_keys'] as $key_room) {
echo $blockname[$key_room].'---'. $blockvalue[$key_room];
echo '<br>';
}
}
else
echo 'No Results';
?>
I'm assuming your task is, given a list rooms, each with the amount of people it can accommodate and the price, to accommodate 10 people (or any other quantity).
This problem is similar to 0-1 knapsack problem which is solvable in polynomial time. In knapsack problem one aims to maximize the price, here we aim to minimize it. Another thing that is different from classic knapsack problem is that full room cost is charged even if the room is not completely occupied. It may reduce the effectiveness of the algorithm proposed at Wikipedia. Anyway, the implementation isn't going to be straightforward if you have never worked with dynamic programming before.
If you want to know more, CLRS book on algorithms discusses dynamic programming in Chapter 15, and knapsack problem in Chapter 16. In the latter chapter they also prove that 0-1 knapsack problem doesn't have trivial greedy solution.

Rewrite a large number of for loops into something shorter

I have the following code:
for($a=1; $a<strlen($string); $a++){
for($b=1; $a+$b<strlen($string); $b++){
for($c=1; $a+$b+$c<strlen($string); $c++){
for($d=1; $a+$b+$c+$d<strlen($string); $d++){
$tempString = substr_replace($string, ".", $a, 0);
$tempString = substr_replace($tempString, ".", $a+$b+1, 0);
$tempString = substr_replace($tempString, ".", $a+$b+$c+2, 0);
$tempString = substr_replace($tempString, ".", $a+$b+$c+$d+3, 0);
echo $tempString."</br>";
}
}
}
}
What it does is to make all possible combinatons of a string with several dots.
Example:
t.est123
te.st123
tes.t123
...
test12.3
Then, I add one more dot:
t.e.st123
t.es.t123
...
test1.2.3
Doing the way I'm doing now, I need to create lots and lots of for loops, each for a determined number of dots. I don't know how I can turn that example into a functon or other easier way of doing this.
Your problem is a combination problem. Note: I'm not a math freak, I only researched this information because of interest.
http://en.wikipedia.org/wiki/Combination#Number_of_k-combinations
Also known as n choose k. The Binomial coefficient is a function which gives you the number of combinations.
A function I found here: Calculate value of n choose k
function choose($n, $k) {
if ($k == 0) {return 1;}
return($n * choose($n - 1, $k - 1)) / $k;
}
// 6 positions between characters (test123), 4 dots
echo choose(6, 4); // 15 combinations
To get all combinations you also have to choose between different algorithms.
Good post: https://stackoverflow.com/a/127856/1948627
UPDATE:
I found a site with an algorithm in different programming languages. (But not PHP)
I've converted it to PHP:
function bitprint($u){
$s= [];
for($n= 0;$u > 0;++$n, $u>>= 1) {
if(($u & 1) > 0) $s[] = $n;
}
return $s;
}
function bitcount($u){
for($n= 0;$u > 0;++$n, $u&= ($u - 1));
return $n;
}
function comb($c, $n){
$s= [];
for($u= 0;$u < 1 << $n;$u++) {
if(bitcount($u) == $c) $s[] = bitprint($u);
}
return $s;
}
echo '<pre>';
print_r(comb(4, 6));
It outputs an array with all combinations (positions between the chars).
The next step is to replace the string with the dots:
$string = 'test123';
$sign = '.';
$combs = comb(4, 6);
// get all combinations (Th3lmuu90)
/*
$combs = [];
for($i=0; $i<strlen($string); $i++){
$combs = array_merge($combs, comb($i, strlen($string)-1));
}
*/
foreach ($combs as $comb) {
$a = $string;
for ($i = count($comb) - 1; $i >= 0; $i--) {
$a = substr_replace($a, $sign, $comb[$i] + 1, 0);
}
echo $a.'<br>';
}
// output:
t.e.s.t.123
t.e.s.t1.23
t.e.st.1.23
t.es.t.1.23
te.s.t.1.23
t.e.s.t12.3
t.e.st.12.3
t.es.t.12.3
te.s.t.12.3
t.e.st1.2.3
t.es.t1.2.3
te.s.t1.2.3
t.est.1.2.3
te.st.1.2.3
tes.t.1.2.3
This is quite an unusual question, but I can't help but try to wrap around what you are tying to do. My guess is that you want to see how many combinations of a string there are with a dot moving between characters, finally coming to rest right before the last character.
My understanding is you want a count and a printout of string similar to what you see here:
t.est
te.st
tes.t
t.es.t
te.s.t
t.e.s.t
count: 6
To facilitate this functionality I came up with a class, this way you could port it to other parts of code and it can handle multiple strings. The caveat here is the strings must be at least two characters and not contain a period. Here is the code for the class:
class DotCombos
{
public $combos;
private function combos($string)
{
$rebuilt = "";
$characters = str_split($string);
foreach($characters as $index => $char) {
if($index == 0 || $index == count($characters)) {
continue;
} else if(isset($characters[$index]) && $characters[$index] == ".") {
break;
} else {
$rebuilt = substr($string, 0, $index) . "." . substr($string, $index);
print("$rebuilt\n");
$this->combos++;
}
}
return $rebuilt;
}
public function allCombos($string)
{
if(strlen($string) < 2) {
return null;
}
$this->combos = 0;
for($i = 0; $i < count(str_split($string)) - 1; $i++) {
$string = $this->combos($string);
}
}
}
To make use of the class you would do this:
$combos = new DotCombos();
$combos->allCombos("test123");
print("Count: $combos->combos");
The output would be:
t.est123
te.st123
tes.t123
test.123
test1.23
test12.3
t.est12.3
te.st12.3
tes.t12.3
test.12.3
test1.2.3
t.est1.2.3
te.st1.2.3
tes.t1.2.3
test.1.2.3
t.est.1.2.3
te.st.1.2.3
tes.t.1.2.3
t.es.t.1.2.3
te.s.t.1.2.3
t.e.s.t.1.2.3
Count: 21
Hope that is what you are looking for (or at least helps)....

Reversed numbers puzzle

I'm are trying to teach myself to be better at programming. Part of this I have been taking puzzle that I find in newspapers and magazines and trying to find programming solutions
Today I seen a puzzle regarding numbers that are the reversed when multiplied by a number from 2-9. The example given was 1089 * 9 = 9801.
I have started to write a program in php to find the numbers this applies to and adds them to an array.
First I created a loop to cycle through the possible numbers. I then reversed each of the numbers and created a function to compare the numbered and the reversed number. The function then returns numbers that meet the criteria and adds them to an array.
This is what I have so far...
<?php
function mul(){ // multiply number from loop
for($i=2;$i<=9;$i++){
$new = $num * $i;
if($new == $re){
return $new;
}
else{
return;
}
}
}
$arr = array();
for ($num = 1000; $num <10000; $num++) { //loop through possible numbers
$re = strrev($num); // get reverse of number
func($re,$num); //multiply number and return correct numbers
$arr.push($new); //add to array??
}
?>
I'm still very new to php and I find understanding programming, any pointers on a more logical way of doing this puzzle would be greatly appreciated.
Here's my solution with a nested loop. Quick and dirty.
$result = array();
for ($i = 1000; $i < 5000; $i++) {
for ($m = 2; $m < 10; $m++) {
if ($i*$m == (int)strrev($i)) {
$result[] = array($i, $m);
}
}
}
var_dump($result);
I'd like to expand on this line:
if ($i*$m == (int) strrev($i)) {
One side is $i*$m, easy, the multipication.
On the other, we have (int)strrev($i), which means "Take $i, cast it to a string, and reverse that. Then cast it back into an int.
If that evaluates to true, an array containing $i and m is inserted into the $result array.
I was also looking for logical questions to solve to preparing for my interview. Thanks for sharing this question.I have solved this question but using Java.I have used very basic concept.I hope you can understand and convert it to php.
public static boolean processNumber(int number)
{
for(int i=2;i<=9;i++)
{
int reverseNumber=number*i;
boolean status=checkReverse(reverseNumber,number);
if(status)
{
return true;
}
}
return false;
}
public static boolean checkReverse(int reverseNumber,int numberOriginal)
{
int number=reverseNumber;
int reverse=0,digit;
do
{
digit=number%10;
number=number/10;
reverse=reverse*10+digit;
}while(number>0);
if(reverse==numberOriginal)
{
return true;
}
else
{
return false;
}
}
//This is my little effort towards programming check if it satisfy your requirements
$n = 1089;
$temp = $n;
$sum =0;
//reversing given number
while($n>1){
$rem = $n%10;
$sum = $sum*10 + $rem;
$n = $n/10;
}
//checking for digit that satisfy criteria
for($i=0; $i<=9; $i++){
if($i*$temp == $sum){
echo "$temp * $i = $sum";
}
}

PHP: Caching ordered integer partition algorithm

First: The problem's name in Wikipedia is "ordered partition of a set".
I have an algorithm which counts possible partitions. To speed it up, I use a cache:
function partition($intervalSize, $pieces) {
// special case of integer partitions: ordered integer partitions
// in Wikipedia it is: ordered partition of a set
global $partition_cache;
// CACHE START
$cacheId = $intervalSize.'-'.$pieces;
if (isset($partition_cache[$cacheId])) { return $partition_cache[$cacheId]; }
// CACHE END
if ($pieces == 1) { return 1; }
else {
$sum = 0;
for ($i = 1; $i < $intervalSize; $i++) {
$sum += partition(($intervalSize-$i), ($pieces-1));
}
$partition_cache[$cacheId] = $sum; // insert into cache
return $sum;
}
}
$result = partition(8, 4);
Furthermore, I have another algorithm which shows a list of these possible partitions. But it doesn't use a cache yet and so it's quite slow:
function showPartitions($prefix, $start, $finish, $numLeft) {
global $partitions;
if ($numLeft == 0 && $start == $finish) { // wenn eine Partition fertig ist dann in Array schreiben
$gruppen = split('\|', $prefix);
$partitions[] = $gruppen;
}
else {
if (strlen($prefix) > 0) { // nicht | an Anfang setzen sondern nur zwischen Gruppen
$prefix .= '|';
}
for ($i = $start + 1; $i <= $finish; $i++) {
$prefix .= chr($i+64);
showPartitions($prefix, $i, $finish, $numLeft - 1);
}
}
}
$result = showPartitions('', 0, 8, 4);
So I have two questions:
1) Is it possible to implement a cache in the second algorithm, too? If yes, could you please help me to do this?
2) Is it possible to write the results of the second algorithm into an structured array instead of a string?
I hope you can help me. Thank you very much in advance!
PS: Thanks for the two functions, simonn and Dan Dyer!
No, I don't think a cache will help you here because you're never actually performing the same calculation twice. Each call to showPartitions() has different parameters and generates a different result.
Yes, of course. You're basically using another level of nested arrays pointing to integers to replace a string of characters separated by pipe characters. (Instead of "A|B|C" you'll have array(array(1), array(2), array(3)).)
Try changing showPartitions() as such:
if ($numLeft == 0 && $start == $finish) { // wenn eine Partition fertig ist dann in Array schreiben
$partitions[] = $prefix;
}
else {
$prefix[] = array();
for ($i = $start + 1; $i <= $finish; $i++) {
$prefix[count($prefix) - 1][] = $i;
showPartitions($prefix, $i, $finish, $numLeft - 1);
}
}
and instead of calling it with an empty string for $prefix, call it with an empty array:
showPartitions(array(), 0, 8, 4);
Off topic: I rewrote the first function to be a little bit faster.
function partition($intervalSize, $pieces) {
// special case of integer partitions: ordered integer partitions
// in Wikipedia it is: ordered partition of a set
// CACHE START
static $partition_cache = array();
if (isset($partition_cache[$intervalSize][$pieces])) {
return $partition_cache[$intervalSize][$pieces];
}
// CACHE END
if ($pieces === 1) {
return 1;
}
if ($intervalSize === 1) {
return 0;
}
$sum = 0;
$subPieces = $pieces - 1;
$i = $intervalSize;
while (--$i) {
$sum += partition($i, $subPieces);
}
$partition_cache[$intervalSize][$pieces] = $sum; // insert into cache
return $sum;
}
Although this is a bit old, nevertheless,
a PHP Class which implements various combinatorics/simulation methods including partitions/permutations/combinations etc.. in an efficient way
https://github.com/foo123/Simulacra/blob/master/Simulacra.php
PS: i am the author

Categories