PHP - smart, error tolerating string comparison - php

I'm looking either for routine or way to look for error tolerating string comparison.
Let's say, we have test string Čakánka - yes, it contains CE characters.
Now, I want to accept any of following strings as OK:
cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
CKAANKA
cakakNa
The problem is, that I often switch letters in word, and I want to minimize user's frustration with not being able (i.e. you're in rush) to write one word right.
So, I know how to make ci comparison (just make it lowercase :]), I can delete CE characters, I just can't wrap my head around tolerating few switched characters.
Also, you often put one character not only in wrong place (character=>cahracter), but sometimes shift it by multiple places (character=>carahcter), just because one finger was lazy during writing.
Thank you :]

Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :
int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )
The Levenshtein distance is defined as
the minimal number of characters you
have to replace, insert or delete to
transform str1 into str2
Other possibly useful functions could be soundex, similar_text, or metaphone.
And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)

You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.
Edit    The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.
But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.
Here’s an example:
// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (!isset($index[$code])) {
$index[$code] = array();
}
$index[$code][] = $key;
}
// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (isset($index[$code])) {
echo '<li> '.$word.' is similar to: ';
$matches = array();
foreach ($index[$code] as $key) {
similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
$matches[$knownWords[$key]] = $percentage;
}
arsort($matches);
echo '<ul>';
foreach ($matches as $match => $percentage) {
echo '<li>'.$match.' ('.$percentage.'%)</li>';
}
echo '</ul></li>';
} else {
echo '<li>no match found for '.$word.'</li>';
}
}
echo '</ul>';

Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.

Related

PHP extract comparison operator

I was asked on an interview what would be the fastest way to extract the comparison operator between two statements.
For example rate>=4 the comparison operator is '>=' it should be able to extract '>','<','!=','=','<=','>=','='
The function must return the comparison operator.
This is what I wrote, and they marked it as wrong.
function extractcomp($str)
{
$temp = [];
$matches = array('>','<','!','=');
foreach($matches as $match)
{
if(strpos($str,$match)!== false)
{
$temp[] = $match;
}
}
return implode('',$temp);
}
Does anyone have a better way?
you can read character by character once you hit the first occurrence you can determine what's gonna be the next character i.e.:
$ops = ['>','<','!','='];
$str = "rate!=4";
foreach($ops as $op)
{
if(($c1 = strpos($str, $op)) !== false)
{
$c2 = $str[$c1++] . (($str[$c1] == $ops[3]) ? $str[$c1] : "");
break;
}
}
echo $c2;
So if the first search character is ">" you can only assume the 2nd one is gonna be "=" or it doesn't exist. So you get the index of 1st character and increment it and check if the 2nd character exists in our search array or not. Then return the value. this will loop until it finds the 1st occurrence then breaks.
EDIT:
here's another solution:
$str = "rate!=4";
$arr = array_intersect(str_split($str), ['>','<','=','!']);
echo current($arr).(end($arr) ? end($arr) : '');
not as fast as the loop but definitely decreases the bloat code.
There's always a better way to optimize the code.
Unless they have some monkeywrenching strings to throw at this custom function, I recommend trim() with a ranged character mask. Something like echo trim('rate>=4',"A..Za..z0..9"); would work for your sample input in roughly half the time.
Code: (Demo)
function extractcomp($str){
return trim($str,"A..Za..z0..9");
}
echo extractcomp("rate>=4");
Regarding regex, better efficiency in terms of step count with preg_match() would be to use a character class to match the operators.
Assuming only valid operators will be used, you can use /[><!=]+/ or if you want to tighen up length /[><!=]{1,3}/
Just 8 steps on your sample input string. Demo
This is less strict than Andreas' | based pattern, but takes fewer steps.
It depends on how strict the pattern must be. My pattern will match !==.
If you want to improve your loop method, write a break after you have matched the entire comparison operator.
Actually, you are looping the operators. That would have been their issue (or one of them). Your method will not match ==. I'm not sure if that is a possible comparison (it is not in your list).

filtering words from text with exploits

I have filter which filters bad words like 'ass' 'fuck' etc. Now I am trying to handle exploits like "f*ck", "sh/t".
One thing I could do is matching each words with dictionary of bad word having such exploits. But this is pretty static and not good approach.
Another thing I can do is, using levenshtein distance. Words with levenshtein distance = 1 should be blocked. But this approach also prone to give false positive.
if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)
{
//match
}
I am looking for some way of using regex. May be I can combine levenshtein distance with regex, but I could not figure it out.
Any suggestion is highly appreciable.
Like stated in the comments, it is hard to get this right. This snippet, far from perfect, will check for matches where letters are substituted for the same number of other characters.
It may give you a general idea of how you could solve this, although much more logic is needed if you want to make it smarter. This filter, for instance will not filter 'fukk', 'f ck', 'f**ck', 'fck', '.fuck' (with leading dot) or 'fück', while it does probably filter out '++++' to replace it with 'beep'. But it also filters 'f*ck', 'f**k', 'f*cking' and 'sh1t', so it could do worse. :)
An easy way to make it better, is to split the string in a smarter way, so punctuation marks aren't glued to the word they are adjacent to. Another improvement could be to remove all non-alphabetic characters from each word, and check if the remaining letters are in the same order in a word. That way, 'f\/ck' would also match 'fuck'. Anyway, let your imagination run wild, but be careful for false positives. And trust me that 'they' will always find a way to express themselves in a way that bypasses your filter.
<?php
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);
// Loop through all words.
foreach ($words as $word)
{
$naughty = false;
// Match each bad word against each word.
foreach ($badwords as $badword)
{
// If the word is shorter than the bad word, it's okay.
// It may be bigger. I've done this mainly, because in the example given,
// 'f*ck,' will contain the trailing comma. This could be easily solved by
// splitting the string a bit smarter. But the added benefit, is that it also
// matches derivatives, like 'f*cking' or 'f*cker', although that could also
// result in more false positives.
if (strlen($word) >= strlen($badword))
{
$wordOk = false;
// Check each character in the string.
for ($i = 0; $i < strlen($badword); $i++)
{
// If the letters don't match, and the letter is an actual
// letter, this is not a bad word.
if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
{
$wordOk = true;
break;
}
}
// If the word is not okay, break the loop.
if (!$wordOk)
{
$naughty = true;
break;
}
}
}
// Echo the sensored word.
echo $naughty ? 'beep ' : ($word . ' ');
}

PHP Using str_word_count with strsplit to form array after x words

I've got a large string that I want to put in an array after each 50 words. I thought about using strsplit to cut, but realised that wont take the words in to consideration, just split when it gets to x char.
I've read about str_word_count but can't work out how to put the two together.
What I've got at the moment is:
$outputArr = str_split($output, 250);
foreach($outputArr as $arOut){
echo $arOut;
echo "<br />";
}
But I want to substitute that to form each item of the array at 50 words instead of 250 characters.
Any help will be much appreciated.
Assuming that str_word_count is sufficient for your needs¹, you can simply call it with 1 as the second parameter and then use array_chunk to group the words in groups of 50:
$words = str_word_count($string, 1);
$chunks = array_chunk($words, 50);
You now have an array of arrays; to join every 50 words together and make it an array of strings you can use
foreach ($chunks as &$chunk) { // important: iterate by reference!
$chunk = implode(' ', $chunk);
}
¹ Most probably it is not. If you want to get what most humans consider acceptable results when processing written language you will have to use preg_split with some suitable regular expression instead.
There's another way:
<?php
$someBigString = <<<SAMPLE
This, actually, is a nice' old'er string, as they said, "divided and conquered".
SAMPLE;
// change this to whatever you need to:
$number_of_words = 7;
$arr = preg_split("#([a-z]+[a-z'-]*(?<!['-]))#i",
$someBigString, $number_of_words + 1, PREG_SPLIT_DELIM_CAPTURE);
$res = implode('', array_slice($arr, 0, $number_of_words * 2));
echo $res;
Demo.
I consider preg_split a better tool (than str_word_count) here. Not because the latter is inflexible (it is not: you can define what symbols can make up a word with its third param), but because preg_split will essentially stop processing the string after getting N items.
The trick, as quite common with this function, is to capture delimiters as well, then use them to reconstruct the string with the first N words (where N is given) AND punctuation marks saved.
(of course, the regex used in my example does not strictly comply to str_word_count locale-dependent behavior. But it still restricts the words to consist of alpha, ' and - symbols, with the latter two not at the beginning and the end of any word).

Find longest repeating strings?

I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.
So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.
This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.
I work in PHP, but would enjoy seeing something in any language.
NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.
This will find all repeated strings:
(?=((.+)(?:.*?\2)+))
Use that with preg_match_all and select the longest one.
function len_cmp($match1,$match2) {
return $match2[0] - $match1[0];
}
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}
usort($matches, "len_cmp");
foreach ($matches as $match) {
echo "($matches[2]) $matches[1]\n";
}
This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.
(?=((.{3,})(?:.*?\2){2,}))
This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).
Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.
Seems I'm a little late, but it also does the work:
preg_match_all('/(id|class)+="([a-zA-Z0-9-_ ]+)"/', $html, $matches);
$result = explode(" ", implode(" ", $matches[2]));
$parsed = array();
foreach($result as $string) {
if(isset($parsed[$string])) {
$parsed[$string]++;
} else {
$parsed[$string] = 1;
}
}
arsort($parsed);
foreach($parsed as $k => $v) {
echo $k . " -> Found " . $v . " times<br/>";
}
The ouput will be something like:
some_id -> Found 2 times
some_class -> Found 2 times

How to add currency strings (non-standardized input) together in PHP?

I have a form in which people will be entering dollar values.
Possible inputs:
$999,999,999.99
999,999,999.99
999999999
99,999
$99,999
The user can enter a dollar value however they wish. I want to read the inputs as doubles so I can total them.
I tried just typecasting the strings to doubles but that didn't work. Total just equals 50 when it is output:
$string1 = "$50,000";
$string2 = "$50000";
$string3 = "50,000";
$total = (double)$string1 + (double)$string2 + (double)$string3;
echo $total;
A regex won't convert your string into a number. I would suggest that you use a regex to validate the field (confirm that it fits one of your allowed formats), and then just loop over the string, discarding all non-digit and non-period characters. If you don't care about validation, you could skip the first step. The second step will still strip it down to digits and periods only.
By the way, you cannot safely use floats when calculating currency values. You will lose precision, and very possibly end up with totals that do not exactly match the inputs.
Update: Here are two functions you could use to verify your input and to convert it into a decimal-point representation.
function validateCurrency($string)
{
return preg_match('/^\$?(\d{1,3})(,\d{3})*(.\d{2})?$/', $string) ||
preg_match('/^\$?\d+(.\d{2})?$/', $string);
}
function makeCurrency($string)
{
$newstring = "";
$array = str_split($string);
foreach($array as $char)
{
if (($char >= '0' && $char <= '9') || $char == '.')
{
$newstring .= $char;
}
}
return $newstring;
}
The first function will match the bulk of currency formats you can expect "$99", "99,999.00", etc. It will not match ".00" or "99.", nor will it match most European-style numbers (99.999,00). Use this on your original string to verify that it is a valid currency string.
The second function will just strip out everything except digits and decimal points. Note that by itself it may still return invalid strings (e.g. "", "....", and "abc" come out as "", "....", and ""). Use this to eliminate extraneous commas once the string is validated, or possibly use this by itself if you want to skip validation.
You don't ever want to represent monetary values as floats!
For example, take the following (seemingly straight forward) code:
$x = 1.0;
for ($ii=0; $ii < 10; $ii++) {
$x = $x - .1;
}
var_dump($x);
You might assume that it would produce the value zero, but that is not the case. Since $x is a floating point, it actually ends up being a tiny bit more than zero (1.38777878078E-16), which isn't a big deal in itself, but it means that comparing the value with another value isn't guaranteed to be correct. For example $x == 0 would produce false.
http://p2p.wrox.com/topic.asp?TOPIC_ID=3099
goes through it step by step
[edit] typical...the site seems to be down now... :(
not a one liner, but if you strip out the ','s you can do: (this is pseudocode)
m/^\$?(\d+)(?:\.(\d\d))?$/
$value = $1 + $2/100;
That allows $9.99 but not $9. or $9.9 and fails to complain about missplaced thousands separators (bug or feature?)
There is a potential 'locality' issue here because you are assuming that thousands are done with ',' and cents as '.' but in europe it is opposite (e.g. 1.000,99)
I recommend not to use a float for storing currency values. You can get rounding errors if the sum gets large. (Ok, if it gets very large.)
Better use an integer variable with a large enough range, and store the input in cents, not dollars.
I belive that you can accomplish this with printf, which is similar to the c function of the same name. its parameters can be somewhat esoteric though. you can also use php's number_format function
Assuming that you are getting real money values, you could simply strip characters that are not digits or the decimal point:
(pseudocode)
newnumber = replace(oldnumber, /[^0-9.]/, //)
Now you can convert using something like
double(newnumber)
However, this will not take care of strings such as "5.6.3" and other such non-money strings. Which raises the question, "Do you need to handle badly formatted strings?"

Categories