Using levenshtein to match target string + extra text - php

I'm working on a website conversion project, and I need to match inexact strings. I'm looking at using leveshtein, but I don't know what parameters I should set for my task.
Say I have a target string elephant. The match I would want to pull is elephant mouse, for example
<?
$target = "elephant";
$data = array(
'elephant mouse',
'rhinoceros',
'alligator',
'hippopotamus',
'rat',
);
foreach ( $data as $datum ) {
echo "$target >> $datum == " . levenshtein($target, $datum) . "\n";
}
And I get the result
elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7
So while rhino and hippo are at 10, in my actual data set, I couldn't really tell the difference between elephant mouse, rat and alligator, which are neck-and-neck at 6 and 7. This is bogus data, but in my data set, words that are closer in length only get a much lower score than words that are target + extra.
How should I configure the options of levenshtein()? I can set new integer values for the cost of insertion, replacement, and deletion. What weighting will give me what I want?
(If you can think of a better title please edit my post).

The weighting levenshtein($target, $datum, 1, 10, 10) gives me
elephant >> elephant mouse == 6
elephant >> rhinoceros == 65
elephant >> alligator == 52
elephant >> hippopotamus == 64
elephant >> rat == 60
Which works very well :) Insertion is a low cost, while both replacement and deletion are high. This means that target + extra has a low score, where strings of equal or shorter length, but different characters, have a high cost.

You should probably try to match individual words with levenshtein() rather than entire phrases, since you apparently want to consider a phrase a good match if it contains something that resembles the word being searched for. In other words, split each string in $datum into individual words, run levenshtein($target, $word) for each word, and pick the lowest number. (If $target also can consist of multiple words, you need to split that one too.)
I strongly doubt that you can achieve the desired effect by tweaking the insertion/deletion/replacement costs, because the Levenshtein doesn't consider individual words, only the string as a whole. You could try to make insertion very cheap, but that would also give a good score to e.g. "qwErtyLasdEdgfhdPasdxcHdfjAlkjNlkhTkjh" since it contains all the right letters.

Related

Extracting two lowest bits and adding to another word

I've spent the last hour going around in circles on this so it's finally time to ask for help. I have two binary 8-bit words such as these:
$words[0]="00000101";
$words[1]="01111001";
I want to take the two right bits (01) of $words[0] and append it to the start of $words[1] to make 0101111001 = 377 in decimal.
The easiest way would be to use PHPs string functions to do a substr() but I'd rather learn how to do it using bitwise operators as I need to do this for lots of other examples as well.
What I thought I'd do is to to do 00000101 AND 0x03 to give me 64 and then shift the bits 8 places to the right so that I can add them to $words[1] using OR.
In code that would be:
($array[0] & 0x03 << 8) | $array[1]
but I just get the value of $array[1] back. It seems that it's not possible to shift a value to left more than 8 bits as it gets set to zero (which makes sense).
So, how can accomplish what I want to do ?
You didn't use binaries. Binaries starts with 0b... without "".
This code works:
$words[0]=0b00000101;
$words[1]=0b01111001;
$last = $words[0] & 0b11;
$shift = $last << 8;
$newWord = $shift | $words[1];
echo $newWord;
Or in short form:
echo ($words[0] & 0b11) << 8 | $words[1];
I use the bitwise & 0b11 to get the last two digits from the first word, shift it with <<8 and use the bitwise or |.
For the last | you can use + instead, if you want.
EDIT:
If you want the result as binary just use decbin:
echo decbin($newWord);

Two-way hashing of fixed range numbers

I need to create a function which takes a single integer as argument in the range 0-N and returns a seemingly random number in the same range.
Each input number should always have exactly one output and it should always be the same.
Such a function would produce something like this:
f(1) = 4
f(2) = 1
f(3) = 5
f(4) = 2
f(5) = 3
I believe this could be accomplished by some kind of a hashing algorithm? I don't need anything complex, just not something too simple like f(1) = 2, f(2) = 3 etc.
The biggest issue is that I need this to be reversible. E.g. the above table should be true left-to-right as well as right-to-left, using a different function for the right-to-left conversion is fine.
I know the easiest way is to create an array, shuffle it and just store the relations in a db or something, but as I need N to be quite large I'd like to avoid this if possible.
Edit: For my particular case N is a specific number, it's exactly 16777216 (64^4).
If the range is always a power of two -- like [0,16777216) -- then you can use exclusive-or just as #MarkBaker suggested. It just doesn't work so easily if your range is not a power of two.
You can use addition and subtraction modulo N, although these alone are too obvious, so you have to combine it with something else.
You can also do multiplication modulo-N, but reversing that is complicated. To make it simpler, we can isolate the bottom eight bits and multiply those and add them in a way that doesn't interfere with those bits so we can use them again to reverse the operation.
I don't know PHP so I'm going to give an example in C, instead. Maybe it's the same.
int enc(int x) {
x = x + 4799 * 256 * (x % 256);
x = x + 8896843;
x = x ^ 4777277;
return (x + 1073741824) % 16777216;
}
And to decode, play the operations back in reverse order:
int dec(int x) {
x = x + 1073741824;
x = x ^ 4777277;
x = x - 8896843;
x = x - 4799 * 256 * (x % 256);
return x % 16777216;
}
That 1073741824 must be a multiple of N, and 256 must be a factor of N, and if N is not a power of two then you can't (necessarily) use exclusive-or (^ is exclusive-or in C and I assume in PHP too). The other numbers you can fiddle with, and add and remove stages, at your leisure.
The addition of 1073741824 in both functions is to ensure that x stays positive; this is so that the modulo operation doesn't ever give a negative result, even after we've subtracted values from x which might have made it go negative in the interim.
I offered to describe how I "randomly" scramble up 9-digit SSNs when producing research data sets. This does not replace or hash an SSN. It re-orders the digits. It is difficult to put the digits back in the correct order if you don't know the order in which they were scrambled. I have a gut feeling that this is not what the questioner really wants. So, I am happy to delete this answer if it is deemed off-topic.
I know that I have 9 digits. So, I start with an array that has 9 index values in order:
$a = array(0,1,2,3,4,5,6,7,8);
Now, I need to turn a key that I can remember into a way to shuffle the array. The shuffling has to be the same order for the same key every time. I use a couple tricks. I use crc32 to turn a word into a number. I use srand/rand to get a predictable order of random values. Note: mt_rand no longer produces the same sequence of random digits with the same seed, so I have to use rand.
srand(crc32("My secret key"));
usort($a, function($a, $b) { return rand(-1,1); });
The array $a still has the digits 0 through 8, but they are shuffled. If I use the same keyword I will get the same shuffled order every time. That lets me repeat this every month and get the same result. Then, with a shuffled array, I can pick the digits off the SSN. First, I ensure it has 9 characters (some SSNs are sent as integers and a leading 0 is omitted). Then, I build a masked SSN by picking the digits using $a.
$ssn = str_pad($ssn, 9, '0', STR_PAD_LEFT);
$masked_ssn = '';
foreach($a as $i) $masked_ssn.= $ssn{$i};
$masked_ssn will now have all the digits in $ssn, but in a different order. Technically, there are keywords that make $a become the original ordered array after shuffling, but that is very very rare.
Hopefully this makes sense. If so, you can do it all much faster. If you turn the original string into an array of characters, you can shuffle the array of characters. You just need to reseed rand every time.
$ssn = "111223333"; // Assume I'm using a proper 9-digit SSN
$a = str_split($ssn);
srand(crc32("My secret key"));
usort($a, function($a, $b) { return rand(-1,1); });
$masked_ssn = implode('', $a);
This is not really faster in a runtime way because rand is a rather expensive function and you run rand a hell of lot more here. If you are masking thousands of values as I do, you will want to use an index array that is shuffled just once, not a shuffling for every value.
Now, how do I undo it? Assume I'm using the first method with the index array. It will be something like $a = {5, 3, 6, 1, 0, 2, 7, 8, 4}. Those are the indexes for the original SSN in the masked order. So, I can easily build the original SSN.
$ssn = '000000000'; // I like to define all 9 characters before I start
foreach($a as $i=>$j) $ssn[$j] = $masked_ssn{$i};
As you can see, $i counts from 0 to 8 across the masked SSN. $j counts 5, 3, 6... and puts each value from the masked SSN in the correct place in the original SSN.
Looks like you've got good answer, but still there is an alternative. Linear Congruential Generator (LCG) could provide 1-to-1 mapping and it is known to be a reversible using Euclid's algorithm. For 24bit
Xi = [(A * Xi-1) + C] Mod M
where M = 2^24 = 16,777,216
A = 16,598,013
C = 12,820,163
For LCG reversability take a look at Reversible pseudo-random sequence generator

Similar text percentage in php

It seems so easy to find the percentage between two strings using php code, I just use
int similar_text ( string $first , string $second [, float &$percent ]
but assume that I have two strings for example:
1- Sponsors back away from Sharapova after failed drug test
2- Maria Sharapova failed drugs test at Australian Open
With similar_text tool I got 53.7% but it doesn't make any sense because the two strings are talking about "failed drug test" for "Sharapova" and the percent should be more than 53.7%.
My question is: is there any way to find the real similarity percent between two strings?
I have implemented several algorithms that will search for duplicates and they can be quite similar.
The approach I am usually using is the following:
normalize the strings
use a comparison algorithm (e.g. similar_text, levenshtein, etc.)
It appears to me that in implementing step 1) you will be able to improve your results drastically.
Example of normalization algorithm (I use "Sponsors back away from Sharapova after failed drug test" for the details):
1) lowercase the string
-> "sponsors back away from sharapova after failed drug test"
2) explode string in words
-> [sponsors, back, away, from, sharapova, after, failed, drug, test]
3) remove noisy words (like propositions, e.g. in, for, that, this, etc.). This step can be customized to your needs
-> [sponsors, sharapova, failed, drug, test]
4) sort the array alphabetically (optional, but this can help implementing the algorithm...)
-> [drug, failed, sharapova, sponsors, test]
Applying the very same algorithm to your other string, you would obtain:
[australian, drugs, failed, maria, open, sharapova, test]
This will help you elaborate a clever algorithm. For example:
for each word in the first string, search the highest similarity in the words of the second string
accumulate the highest similarity
divide the accumulated similarity by the number of words
$words1 = ['drug', 'failed', 'sharapova', 'sponsors', 'test'];
$words2 = ['australian', 'drugs', 'failed', 'maria', 'open', 'sharapova', 'test'];
$nbWords1 = count($words1);
$stringSimilarity = 0;
foreach($words1 as $word1){
$max = null;
$similarity = null;
foreach($words2 as $word2){
similar_text($word1, $word2, $similarity);
if($similarity > $max){ //1)
$max = $similarity;
}
}
$stringSimilarity += $max; //2)
}
var_dump(($stringSimilarity/$nbWords1)); //3)
Running this code will give you 84.83660130719. Not bad, I think ^^. I am sure this algorithm can be further refined, but this is a good start... Also here, we are basically computing the average similarity percentage for each words, you may want a different final approach... tune for your needs ;-)

Fun Math with php product picker- permutations, combinations, something in between

I have a product configurator for a web store. I must generate a "simple product sku" for every possible combination of items. The item is a box containing bags of potato chips. The box can be divided into 1, 2 or 3 compartments for different flavors.
1 compartment is trivial. Just iterate through the flavors and spit out a sku for each.
2 compartments is still easy: just use a combination (N choose 2) using the php Math_Combinatorials library.
3 compartments is hard!
Unlike the 2 compartment option where flavors must be unique, with three compartments you can have say:
BBQ, BBQ, PLAIN
However, we don't want to make a sku for
BBQ, BBQ, PLAIN and PLAIN, BBQ, BBQ
So, this is neither a combination or permutation function anymore.
My idea is to generate the permutations, then assign a numeric value to each flavor, add each line up, and if two lines add to the same number, they are a duplicate combination.
Only duplicate combinations should add up to the same values. I'm thinking of this in terms of how Unix File system permissions work- the octal numbers for read, write and execute add up to 7 in octal. Does anyone know how to choose the correct values (1,2,4 in unix perms) to make this work?
Any other approaches come to mind?
Thanks!
I don't know if I totally understand the question but what about this:
<?php
//Generate fake combo
$choices = array('bbq', 'plain', 'one', 'two', 'three');
$combo = array();
$total = rand(1,6);
for ($x = 1; $x <= $total; $x++) {
$combo[] = $choices[array_rand($choices)];
}
var_dump($combo);
//Make SKU
$combo = array_unique($combo);
sort($combo);
$sku = implode('-', $combo);
echo $sku;

PHP random - exluding/include/floats/negatives and other animals

I need to generate a random pairs of numbers (floats) , within a certain range .
Basically those are points for [Lat,Lng] pairs (Longitude - latitude coordinates)
I thought This would be very straight forward with
<?php echo (rand(0,60).'.'.rand(0,99999999999999).' || '); // yes, a string is ok...?>
but it did not give me the control over how many numbers after the float point (resolution) - which I need fixed.
So the next phase was :
<?php echo (rand(0,60*pow(10,6)))/pow(10,6).' || '; //pow(10,$f) where $ is the resolution ?>
and this works . sort of ..
sometimes it produces results like
22.212346 || 33.134 || 36.870757 || //(rare , but does happen)
but hey - coordinates are from -90 to 90 (lon) and -180 to 180 (lan) - what about the minus ?
echo (rand(0,-180*pow(10,9)))/pow(10,9).' || ';
that should do it .. and combining all together should give me somehow a random string like
23.0239423525 || -135.937419777
so after all this introduction - here is Are my question(s) .
Being the newbie that I am - am I missing something ? is there no built-in function to generate random floats with a negative to positive range in PHP ?
Why is the function above sometimes turns only resolution 3,4 or 5 if it is told to return 6 (i did not apply any ABS or ROUND) - is there an automatic rounding in php ? and if there is , how to avoid it ?
I have noticed that the "random" is not so random - the generated numbers are always more or less series between a range - close to one another . is the PHP random a simple very-very-very fast rotating counter ?
how do I EXCLUDE a range from this generated range ?? (or actually and array of ranges)
I know these are a lot of questions, but any help / thought would be great ! (and if the first one is answered positively, the rest can almost be ignored :-)
The rand() function can take a negative min so maybe you can do this:
$num = mt_rand(-180000000, 180000000)/1000000;
echo number_format($num, 6);
if you want 6 places after decimal point.
In terms of excluding a range you may have to do it in two steps. First consider the ranges that you do want. Lets say you have 3 ranges from which you want to generate the random number. range 1 = -180 to -10, range 2 = 10 - 100 and range 3 = 120 - 180. Then you can generate a random number from 1 to 3 inclusive, use that to pick one range and then generate the number in that range.

Categories