PHP Regex for similarity check

PHP Regex for similarity check - php

Can you think of any regular expression that resolves these similarities in PHP? The idea is to get a match without considering the last letters.
<?php
$word1 = 'happyness';
$word2 = 'happys';
if (substr($word1, 0, -4) == substr($word2, 0, -1))
{
echo 'same word1';
}
$word1 = 'kisses';
$word2 = 'kiss';
if (substr($word1, 0, -2) == $word2)
{
echo 'same word2';
}
$word1 = 'consonant';
$word2 = 'consonan';
if (substr($word1, 0, -1) == $word2)
{
echo 'same word3';
}

By putting the words together like happys happyness and capturing as many word characters from word 1 as word 2 matches. See this demo at regex101. Use it with the i flag for casless matching.
^(\w+)\w* \1
To use this in PHP with preg_match see this PHP demo at tio.run
preg_match('/^(\w+)\w* \1/i', preg_quote($word1,'/')." ".preg_quote($word2,'/'), $out);
where $out[1] holds the captures or $out would be an empty array if there wasn't a match.

You could use a small helper function, the first function just matches up to the length of the second string, so doesn't care how many characters it truncates. The main code works similar to your code except it uses the length of the second value as the length of the substring to take...
function match( string $a, string $b ) {
return substr($a, 0, strlen($b)) === $b;
}
This function is slightly more complicated as it takes into account a maximum gap length...
function match( string $a, string $b, int $length = 3 ) {
$len = max(strlen($a)-$length, strlen($b));
return substr($a, 0, $len) === $b;
}
So call it something along the lines of
$word1 = 'happyness';
$word2 = 'happys';
if (match($word1,$word2))
{
echo 'same word1';
}

You can use preg_match to match these data with regex as /^word2/ against word1. So regex would check if word1 starts with word2 or not, because of ^ symbol at the start.
It's always better to preg_quote() before matching to escape regex meta characters for accurate results.
<?php
$tests = [
[
'happyness',
'happys'
],
[
'kisses',
'kiss'
],
[
'consonant',
'consonan'
]
];
$filtered = array_filter($tests,function($values){
$values[1] = preg_quote($values[1]);
return preg_match("/^$values[1]/",$values[0]) === 1;
});
print_r($filtered);
Demo: https://3v4l.org/SLf15

You could also do a small function to find the similarity between the given 2 words. It could look like:
function similarity($word1, $word2)
{
$splittedWord1 = str_split($word1);
$splittedWord2 = str_split($word2);
$similarChars = array_intersect_assoc($splittedWord1, $splittedWord2);
return count($similarChars) / max(count($splittedWord1), count($splittedWord2));
}
var_dump(similarity('happyness', 'happys'));
var_dump(similarity('happyness', 'testhappys'));
var_dump(similarity('kisses', 'kiss'));
var_dump(similarity('consonant', 'consonan'));
The result would look like:
float(0.55555555555556)
int(0)
float(0.66666666666667)
float(0.88888888888889)
Based on the resulted percentage you could decide if the given words should be considered the same or not.

I'm not sure regex is the answer here.
You could try similar_text(), which returns the number of similar characters (and optionally sets a percentage value to a variable). Maybe if you consider the last two letters as non-important, you can see if the strlen() - $skippedCharacters is the same as what is matched. For example:
$skippedCharacters = 2;
$word1 = 'kisses';
$word2 = 'kiss';
$match = similar_text($word1, $word2);
if ($match + $skippedCharacters >= strlen($word1))
{
echo 'same word2';
}

You could use the PHP levenshtein function.
The levenshtein() function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2.
$lev = levenshtein($word1, $word2);
The lower the number the bigger the similarity.

Related

Count how many numbers appears in a string

I was wondering if there's a combo of functions or a direct function that can count how many numbers appears in a string, without use a long-way as str_split and check every character in a loop.
From a string like:
fdsji2092mds1039m
It returns that there's 8 numbers inside.

You can use filter_var() with the FILTER_SANITIZE_NUMBER_INT constant, then check the length of the new string. The new string will contain only numbers from that string, and all other characters are filtered away.
$string = "j3987snmj3j";
$numbers = filter_var($string , FILTER_SANITIZE_NUMBER_INT);
$length = strlen($numbers); // 5
echo "There are ".$length." numbers in that string";
Note that each number will be counted individually, so 137 would return 3, as would 1m3j7.
Live demo

Other solution:
function countNumbers(string $string) {
return preg_match_all('/\d/', $string, $m);
}
You can use regular expression

Try like this:
$myString = 'Som3 Charak1ers ar3 N0mberZ h3re ;)';
$countNumbers = strlen((string)filter_var($myString, FILTER_SANITIZE_NUMBER_INT));
echo 'Your input haz ' . $countNumbers . ' digits in it, man';
You can also make a function out of this to return only the number, if you need it.

Following code does what you intend to do:
<?php
$string = 'dsfds98fsdfsdf8sdf908f9dsf809fsd809f8s15d0d';
$splits=str_split($string);
$count=0;
foreach ($splits as $split){
if(is_numeric($split)){
$count++;
}
}
print_r($count);
Output: 17

PHP - Turn this string: "adc 25...123.50 xyz" into 2 variables: "25" and "123.50"?

The title almost much sums what i am trying to accomplish.
I have a string that could consist of letters in the alphabet or, numbers or characters like ")" and "*". It may also include a numeric string separated by three dots "...", e.g. "25...123.50".
An example of this string could be:
peaches* 25...123.50 +("apples") or -(peaches*) apples* 25...123.50
Now, what i would like to do is capture the numbers before and after the three dots, so i end up with 2 variables, 25 and 123.50. I would then like to trim the string so that i end up with a string that excludes the number values:
peaches* +("apples") or -(peaches*) apples*
So essentially:
$string = 'peaches* 25...123.50 +("apples")';
if (preg_match("/\.\.\./", $string ))
{
# How do i get the left value (could or could not be a decimal, using .)
$from = 25;
# How do i get the right value (could or could not be a decimal, using .)
$to = 123.50;
# How do i remove the value "here...here" is this right?
$clean = preg_replace('/'.$from.'\.\.\.'.$to.'/', '', $string);
$clean = preg_replace('/ /', ' ', $string);
}
If anyone could provide me with some input on the best way to go about this complicated task it would be greatly appreciated! Any suggestions, advice, input, feedback or comments are most welcome, Thank you!

This preg_match should work:
$str = 'peaches* 25...123.50 +("apples")';
if (preg_match('~(\d+(?:\.\d+)?)\.{3}(\d+(?:\.\d+)?)~', $str, $arr))
print_r($arr);

Pseudo code
In a loop:
Perform a strpos for "..." and substr at that position. Then go back from the end of that substring (character by character), checking to see if each is_numeric or a period. On the first non-numeric/non-period occurrence, you grab a substring from the beginning of the original string to that point (store it temporarily). Then start checking for is_numeric or period in the other direction. Grab a substring and add it to the other substring you stored. Repeat.
It's not a regex, but it will accomplish the same goal nonetheless.
Some php
$my_string = "blah blah abc25.4...123.50xyz blah blah etc";
$found = 1;
while($found){
$found = $cursor = strpos($my_string , "...");
if(!empty($found)){
//Go left
$char = ".";
while(is_numeric($char) || $char == "."){
$cursor--;
$char = substr($my_string , $cursor, 1);
}
$left_substring = substr($my_string , 1, $cursor);
//Go right
$cursor = $found + 2;
$char = ".";
while(is_numeric($char) || $char == "."){
$cursor++;
$char = substr($my_string , $cursor, 1);
}
$right_substring = substr($my_string , $cursor);
//Combine the left and right
$my_string = $left_substring . $right_substring;
}
}
echo $my_string;

str replace - replace the x-value

Assuming I have a string
$str="0000,1023,1024,1025,1024,1023,1027,1025,1024,1025,0000";
there are three 1024, I want to replace the third with JJJJ, like this :
output :
0000,1023,1024,1025,1024,1023,1027,1025,JJJJ,1025,0000
how to make str_replace can do it
thanks for the help

As your question asks, you want to use str_replace to do this. It's probably not the best option, but here's what you do using that function. Assuming you have no other instances of "JJJJ" throughout the string, you could do this:
$str = "0000,1023,1024,1025,1024,1023,1027,1025,1024,1025,0000";
$str = str_replace('1024','JJJJ',$str,3)
$str = str_replace('JJJJ','1024',$str,2);

Here is what I would do and it should work regardless of values in $str:
function replace_str($str,$search,$replace,$num) {
$pieces = explode(',',$str);
$counter = 0;
foreach($pieces as $key=>$val) {
if($val == $search) {
$counter++;
if($counter == $num) {
$pieces[$key] = $replace;
}
}
}
return implode(',',$pieces);
}
$str="0000,1023,1024,1025,1024,1023,1027,1025,1024,1025,0000";
echo replace_str($str, '1024', 'JJJJ', 3);
I think this is what you are asking in your comment:
function replace_element($str,$search,$replace,$num) {
$num = $num - 1;
$pieces = explode(',',$str);
if($pieces[$num] == $search) {
$pieces[$num] = $replace;
}
return implode(',',$pieces);
}
$str="0000,1023,1024,1025,1024,1023,1027,1025,1024,1025,0000";
echo replace_element($str,'1024','JJJJ',9);

strpos has an offset, detailed here: http://php.net/manual/en/function.strrpos.php
So you want to do the following:
1) strpos with 1024, keep the offset
2) strpos with 1024 starting at offset+1, keep newoffset
3) strpos with 1024 starting at newoffset+1, keep thirdoffset
4) finally, we can use substr to do the replacement - get the string leading up to the third instance of 1024, concatenate it to what you want to replace it with, then get the substr of the rest of the string afterwards and concatenate it to that. http://www.php.net/manual/en/function.substr.php

You can either use strpos() three times to get the position of the third 1024 in your string and then replace it, or you could write a regex to use with preg_replace() that matches the third 1024.

if you want to find the last occurence of your string you can used strrpos

Do it like this:
$newstring = substr_replace($str,'JJJJ', strrpos($str, '1024'), strlen('1024') );
See working demo

Here's a solution with less calls to one and the same function and without having to explode, iterate over the array and implode again.
// replace the first three occurrences
$replaced = str_replace('1024', 'JJJJ', $str, 3);
// now replace the firs two, which you wanted to keep
$final = str_replace('JJJJ', '1024', $replaced, 2);

Get the current + the next word in a string

this is what I try to get:
My longest text to test When I search for e.g. My I should get My longest
I tried it with this function to get first the complete length of the input and then I search for the ' ' to cut it.
$length = strripos($text, $input) + strlen($input)+2;
$stringpos = strripos($text, ' ', $length);
$newstring = substr($text, 0, strpos($text, ' ', $length));
But this only works first time and then it cuts after the current input, means
My lon is My longest and not My longest text.
How I must change this to get the right result, always getting the next word. Maybe I need a break, but I cannot find the right solution.
UPDATE
Here is my workaround till I find a better solution. As I said working with array functions does not work, since part words should work. So I extended my previous idea a bit. Basic idea is to differ between first time and the next. I improved the code a bit.
function get_title($input, $text) {
$length = strripos($text, $input) + strlen($input);
$stringpos = stripos($text, ' ', $length);
// Find next ' '
$stringpos2 = stripos($text, ' ', $stringpos+1);
if (!$stringpos) {
$newstring = $text;
} else if ($stringpos2) {
$newstring = substr($text, 0, $stringpos2);
} }
Not pretty, but hey it seems to work ^^. Anyway maybe someone of you have a better solution.

You can try using explode
$string = explode(" ", "My longest text to test");
$key = array_search("My", $string);
echo $string[$key] , " " , $string[$key + 1] ;
You can take i to the next level using case insensitive with preg_match_all
$string = "My longest text to test in my school that is very close to mY village" ;
var_dump(__search("My",$string));
Output
array
0 => string 'My longest' (length=10)
1 => string 'my school' (length=9)
2 => string 'mY village' (length=10)
Function used
function __search($search,$string)
{
$result = array();
preg_match_all('/' . preg_quote($search) . '\s+\w+/i', $string, $result);
return $result[0];
}

There are simpler ways to do that. String functions are useful if you don't want to look for something specific, but cut out a pre-defined length of something. Else use a regular expression:
preg_match('/My\s+\w+/', $string, $result);
print $result[0];
Here the My looks for the literal first word. And \s+ for some spaces. While \w+ matches word characters.
This adds some new syntax to learn. But less brittle than workarounds and lengthier string function code to accomplish the same.

An easy method would be to split it on whitespace and grab the current array index plus the next one:
// Word to search for:
$findme = "text";
// Using preg_split() to split on any amount of whitespace
// lowercasing the words, to make the search case-insensitive
$words = preg_split('/\s+/', "My longest text to test");
// Find the word in the array with array_search()
// calling strtolower() with array_map() to search case-insensitively
$idx = array_search(strtolower($findme), array_map('strtolower', $words));
if ($idx !== FALSE) {
// If found, print the word and the following word from the array
// as long as the following one exists.
echo $words[$idx];
if (isset($words[$idx + 1])) {
echo " " . $words[$idx + 1];
}
}
// Prints:
// "text to"

Find first character that is different between two strings

Given two equal-length strings, is there an elegant way to get the offset of the first different character?
The obvious solution would be:
for ($offset = 0; $offset < $length; ++$offset) {
if ($str1[$offset] !== $str2[$offset]) {
return $offset;
}
}
But that doesn't look quite right, for such a simple task.

You can use a nice property of bitwise XOR (^) to achieve this: Basically, when you xor two strings together, the characters that are the same will become null bytes ("\0"). So if we xor the two strings, we just need to find the position of the first non-null byte using strspn:
$position = strspn($string1 ^ $string2, "\0");
That's all there is to it. So let's look at an example:
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");
printf(
'First difference at position %d: "%s" vs "%s"',
$pos, $string1[$pos], $string2[$pos]
);
That will output:
First difference at position 7: "a" vs "i"
So that should do it. It's very efficient since it's only using C functions, and requires only a single copy of memory of the string.
Edit: A MultiByte Solution Along The Same Lines:
function getCharacterOffsetOfDifference($str1, $str2, $encoding = 'UTF-8') {
return mb_strlen(
mb_strcut(
$str1,
0, strspn($str1 ^ $str2, "\0"),
$encoding
),
$encoding
);
}
First the difference at the byte level is found using the above method and then the offset is mapped to the character level. This is done using the mb_strcut function, which is basically substr but honoring multibyte character boundaries.
var_dump(getCharacterOffsetOfDifference('foo', 'foa')); // 2
var_dump(getCharacterOffsetOfDifference('©oo', 'foa')); // 0
var_dump(getCharacterOffsetOfDifference('f©o', 'fªa')); // 1
It's not as elegant as the first solution, but it's still a one-liner (and if you use the default encoding a little bit simpler):
return mb_strlen(mb_strcut($str1, 0, strspn($str1 ^ $str2, "\0")));

If you convert a string to an array of single character one byte values you can use the array comparison functions to compare the strings.
You can achieve a similar result to the XOR method with the following.
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$array1 = str_split($string1);
$array2 = str_split($string2);
$result = array_diff_assoc($array1, $array2);
$num_diff = count($result);
$first_diff = key($result);
echo "There are " . $num_diff . " differences between the two strings. <br />";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.";
Edit: Multibyte Solution
$string1 = 'foorbarbaz';
$string2 = 'foobarbiz';
$array1 = preg_split('((.))u', $string1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$array2 = preg_split('((.))u', $string2, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = array_diff_assoc($array1, $array2);
$num_diff = count($result);
$first_diff = key($result);
echo "There are " . $num_diff . " differences between the two strings.\n";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.\n";

I wanted to add this as as comment to the best answer, but I do not have enough points.
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");
if ($pos < min(strlen($string1), strlen($string2)){
printf(
'First difference at position %d: "%s" vs "%s"',
$pos, $string1[$pos], $string2[$pos]
);
} else if ($pos < strlen($string1)) {
print 'String1 continues with' . substr($string1, $pos);
} else if ($pos < strlen($string2)) {
print 'String2 continues with' . substr($string2, $pos);
} else {
print 'String1 and String2 are equal';
}

string strpbrk ( string $haystack , string $char_list )
strpbrk() searches the haystack string for a char_list.
The return value is the substring of $haystack which begins at the first matched character.
As an API function it should be zippy. Then loop through once, looking for offset zero of the returned string to obtain your offset.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regex for similarity check - php

Related

Count how many numbers appears in a string

PHP - Turn this string: "adc 25...123.50 xyz" into 2 variables: "25" and "123.50"?

str replace - replace the x-value

Get the current + the next word in a string

Find first character that is different between two strings

Categories

Resources