Finding top similar strings in PHP?

Finding top similar strings in PHP? - php

I have an array of 17,000 strings. Many of the strings have similar matches, for example:
User Report XYZ123
Bob Smith
User Report YEI723
User Report
User Report
Number of Hits 27
Frank's Weekly Transaction Report
Transaction Report 123
What is the best way to find the top "similar strings"? For instance, using the example above, I would want to see "User Report" and "Transaction Report" as two of the top "similar strings".

Without giving you all the source code to do this, you could go through the array and remove components you consider useless, like any letters with numbers, and so on.
Then you can use array_count_values() and sort that array to see the top ones involved.

I guess you could do a foreach through each of the strings and eliminate the ones that you don't want for that particular search. Then go through the once you have left (possibly with another foreach) and keep shrinking the number of strings that you have an interest in down until there are just a few. Then sort those by something like alphabetical order.

You could compute the Levenstein distance for each string compared with others and then sort them by that value.
$strings = array('str1', 'str2', 'car', 'dog', 'apple', 'house', 'str3');
$len = count($strings);
$distances = array_fill(0, $len, 0);
for($i=0; $i<$len-1; ++$i)
for($j=$i+1; $j<$len; ++$j)
{
$dist = levenshtein($strings[$i], $strings[$j]);
$distances[$i] += $dist;
$distances[$j] += $dist;
}
// Here $distances indicates how of "similar" is each string
// The lower values are more "similar"

If you are able to get all the strings as an array and loop them in a foreach() like this:
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = 'string';
$results = array();
foreach($string_array as $key => $val):
if (fnmatch($needle, $val):
$results[] = $val;
endif;
endforeach;
in the end you should end having the entries that match $needle. As alternative to fnmatch() you could use preg_match() and as pattern /string/i
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = '/string/i';
$results = array();
foreach($string_array as $key => $val):
if (!empty(preg_match($needle, $val)):
$results[] = $val;
endif;
endforeach;
Note there could be issues when using empty() and pass the result of preg_match().:
Prior to PHP 5.5, empty() only supports variables; anything else will result in a parse error. In other words, the following will not work: empty(trim($name)). Instead, use trim($name) == false.
No errors should be issued with PHP version 5.3.x < 5.4

Related

How to get the number of characters in a one-dimensional array

I'm posting this question & answer because I've searched SO and haven't found a satisfactory answer for this problem and I hope this question & answer will help others in the future. Feel free to edit or add more different solutions to the ones I've included in my answer.
This question is for one-dimensional arrays only.
So let's say that I have this php array of strings:
$stringsArr = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet'];
And I want to know how many total characters are in that array.
I could loop through the array and run a strlen on each element like this:
$stringsTotalLength = 0;
foreach ($stringsArr as $string) {
$stringsTotalLength += strlen($string);
}
echo $stringsTotalLength; // Returns 22 correctly.
But I was wondering if there was any built-in php function or simple one-liner that could do this more elegantly.

So there are a bunch of different ways to accomplish this, some more elegant, some less so. (Also, benchmarks on these solutions are welcome).
In 1st place, the winner is a combination of strlen and implode:
$stringsTotalLength = strlen(implode($stringsArr));
This works by concatenating all of the elements of the array and getting the length of that string, e.g. ['Lorem', 'ipsum'] -> 'Loremipsum' -> 10.
And in a close 2nd, there's a combination of array_sum, array_map, and strlen:
$stringsTotalLength = array_sum(array_map('strlen', $stringsArr));
This replaces the elements of the array with their lengths, and then gets the sum of the whole array, e.g. ['Lorem', 'ipsum'] -> [5, 5] -> 10.
In 3rd place is the plain old foreach loop, though, while it is very simple, it is also kind of verbose:
$stringsTotalLength = 0;
foreach ($stringsArr as $string) {
$stringsTotalLength += strlen($string);
}
Finally, in last place is a solution even worse than a foreach loop (IMHO), array_map:
$stringsTotalLength = 0;
array_map(function ($string) {
global $stringsTotalLength;
$stringsTotalLength += strlen($string);
}, $stringsArr);

use implode to change the array to string and count the length using strlen
<?php
$stringsArr = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet'];
echo strlen(implode($stringsArr)); //22
?>

match at least 5 number position at same place in given mobile number

Given that I have the following phone number:
9904773779
I would like to search my data base for other phone numbers which have a least 5 digits in common with the above number, like those:
9933723989
9403793378
My first though was to use a query like this:
select mobileno from tbl_registration where mobileno like '%MyTextBox Value%'
however, this didn't not work.

I think a tidier PHP solution here is using similar_text:
This calculates the similarity between two strings.
Sample demo:
$numbers = array("1234567890", "9933723989", "9403793378");
$key = "9904773779";
foreach ($numbers as $k) {
if (similar_text($key, $k) >= 5) { // There must be 5+ similarities
echo $k . PHP_EOL;
}
}
Output: [ 9933723989, 9403793378]
See IDEONE demo

I think I would go with PHP, although not very tidy and there is probably a better way.
<?php
$strOne = '9904773779';
$strTwo = '9933723989';
$arrOne = str_split($strOne);
$arrTwo = str_split($strTwo);
$arrIntersection = array_intersect($arrOne,$arrTwo);
$count=0;
foreach ($arrIntersection as $key => $value) {
if ($arrOne[$key] === $arrTwo[$key]) {
$count++;
}
}
print_r($count);
?>
In the first stage I split the strings to arrays. I then use array_intersect to identify duplicate values and save them into an array. this saves having to loop through every number. I then loop thru the array of identical values and compare both arrays to see if the values are identical.
I do however look forward to a cooler answer.

How to count and output variables named in a specific numeric pattern

I have created few different strings with similar names, and I would like to display them all.
However, I will need to do this dynamically, because I will adding more of them, later on.
These strings are called:
$group1
$group2
$group3
$group4
My idea is to somehow count them all, and then display them with for loop. I just need help with counting part.

Though arrays are definitely the superior and appropriate solution in this case, since you insist in the comments that separate variables are required, you can solve this using a while loop to check for the existence of such consecutively named variables, creating the variable names dynamically with {}.
For example:
$group1 = '123';
$group2 = '456';
$group3 = '789';
$i=1;
while ($string = ${'group'.$i}) {
echo $string;
$i++;
}
Note how ${'group'.$i} dynamically creates each variable name. Also, naturally this approach would fail if the variables are not named consecutively (e.g. if you have $group1 followed by $group3). As said, you should definitely use an array for this.
See a live demo

Associative arrays are designed exactly for this:
$arr = array(
"group1" => "string1",
"group2" => "string2",
"group3" => "string3",
"group4" => "string4",
);
Now to get the length of your array:
$num = count($arr);
To access the first element
$firstElement = $arr["group1"];
Alternatively you can use an indexed array (access elements by their position):
$arr = array("string1", "string2", "string3", "string4");
$firstElement = $arr[0];
$num = count($arr);

you can use this to count+loop
$arr=array();
$add_to=array_push($arr,$group1);
$add_to=array_push($arr,$group2);
$add_to=array_push($arr,$group3);
$add_to=array_push($arr,$group4);
//count
echo count($arr);
//loop
foreach($arr as $key=>$value){
echo $value;
}

How to change order of substrings inside a larger string?

This is fairly confusing, but I'll try to explain as best I can...
I've got a MYSQL table full of strings like this:
{3}12{2}3{5}52
{3}7{2}44
{3}15{2}2{4}132{5}52{6}22
{3}15{2}3{4}168{5}52
Each string is a combination of product options and option values. The numbers inside the { } are the option, for example {3} = Color. The number immediately following each { } number is that option's value, for example 12 = Blue. I've already got the PHP code that knows how to parse these strings and deliver the information correctly, with one exception: For reasons that are probably too convoluted to get into here, the order of the options needs to be 3,4,2,5,6. (To try to modify the rest of the system to accept the current order would be too monumental a task.) It's fine if a particular combination doesn't have all five options, for instance "{3}7{2}44" delivers the expected result. The problem is just with combinations that include option 2 AND option 4-- their order needs to be switched so that any combination that includes both options 2 and 4, the {4} and its corresponding value comes before the {2} and it's corresponding value.
I've tried bringing the column into Excel and using Text to Columns, splitting them up by the "{" and "}" characters and re-ordering the columns, but since not every string yields the same number of columns, the order gets messed up in other ways (like option 5 coming before option 2).
I've also experimented with using PHP to explode each string into an array (which I thought I could then re-sort) using "}" as the delimiter, but I had no luck with that either because then the numbers blend together in other ways that make them unusable.
TL;DR: I have a bunch of strings like the ones quoted above. In every string that contains both a "{2}" and a "{4}", the placement of both of those values needs to be switched, so that the {4} and the number that follows it comes before the {2} and the number that follows it. In other words:
{3}15{2}3{4}168{5}52
needs to become
{3}15{4}168{2}3{5}52
The closest I've been able to come to a solution, in pseudocode, would be something like:
for each string,
if "{4}" is present in this string AND "{2}" is present in this string,
take the "{4}" and every digit that follows it UNTIL you hit another "{" and store that substring as a variable, then remove it from the string.
then, insert that substring back into the string, at a position starting immediately before the "{2}".
I hope that makes some kind of sense...
Is there any way with PHP, Excel, Notepad++, regular expressions, etc., that I can do this? Any help would be insanely appreciated.
EDITED TO ADD: After several people posted solutions, which I tried, I realized that it would be crucial to mention that my host is running PHP 5.2.17, which doesn't seem to allow for usort with custom sorting. If I could upvote everyone's solution (all of which I tried in PHP Sandbox and all of which worked), I would, but my rep is too low.

How would something like this work for you. The first 9 lines just transform your string into an array with each element being an array of the option number and value. The Order establishes an order for the items to appear in and the last does a usort utilizing the order array for positions.
$str = "{3}15{2}2{4}132{5}52{6}22";
$matches = array();
preg_match_all('/\{([0-9]+)\}([0-9]+)/', $str, $matches);
array_shift($matches);
$options = array();
for($x = 0; $x < count($matches[0]); $x++){
$options[] = array($matches[0][$x], $matches[1][$x]);
}
$order = [3,4,2,5,6];
usort($options, function($a, $b) use ($order) {
return array_search($a[0], $order) - array_search($b[0], $order);
});
To get you data back into the required format you would just
$str = "";
foreach($options as $opt){
$str.="{".$opt[0]."}".$opt[1];
}
On of the bonuses here is that when you add a new options type inserting adjusting the order is just a matter of inserting the option number in the correct position of the $order array.

First of all, those options should probably be in a separate table. You're breaking all kinds of normalization rules stuffing those things into a string like that.
But if you really want to parse that out in php, split the string into a key=>value array with something like this:
$options = [];
$pairs = explode('{', $option_string);
foreach($pairs as $pair) {
list($key,$value) = explode('}', $pair);
$options[$key] = $value;
}
I think this will give you:
$options[3]=15;
$options[2]=3;
$options[4]=168;
$options[5]=52;
Another option would be to use some sort of existing serialization (either serialize() or json_encode() in php) instead of rolling your own:
$options_string = json_encode($options);
// store $options_string in db
then
// get $options_string from db
$options = json_decode($options_string);

Here's a neat solution:
$order = array(3, 4, 2, 5, 6);
$string = '{3}15{2}3{4}168{5}52';
$split = preg_split('#\b(?={)#', $string);
usort($split, function($a, $b) use ($order) {
$a = array_search(preg_replace('#^{(\d+)}\d+$#', '$1', $a), $order);
$b = array_search(preg_replace('#^{(\d+)}\d+$#', '$1', $b), $order);
return $a - $b;
});
$split = implode('', $split);
var_dump($split);

Remove composed words

I have a list of words in which some are composed words, in example
palanca
plato
platopalanca
I need to remove "plato" and "palanca" and let only "platopalanca".
Used array_unique to remove duplicates, but those composed words are tricky...
Should I sort the list by word length and compare one by one?
A regular expression is the answer?
update: The list of words is much bigger and mixed, not only related words
update 2: I can safely implode the array into a string.
update 3: I'm trying to avoid doing this as if this was a bobble sort. there must be a more effective way of doing this
Well, I think that a buble-sort like approach is the only possible one :-(
I don't like it, but it's what i have...
Any better approach?
function sortByLengthDesc($a,$b){
return strlen($a)-strlen($b);
}
usort($words,'sortByLengthDesc');
$count = count($words);
for($i=0;$i<=$count;$i++) {
for($j=$i+1;$j<$count;$j++) {
if(strstr($words[$j], $words[$i]) ){
$delete[]=$i;
}
}
}
foreach($delete as $i) {
unset($words[$i]);
}
update 5: Sorry all. I'm A moron. Jonathan Swift make me realize I was asking the wrong question.
Given x words which START the same, I need to remove the shortests ones.
"hot, dog, stand, hotdogstand" should become "dog, stand, hotdogstand"
"car, pet, carpet" should become "pet, carpet"
"palanca, plato, platopalanca" should become "palanca, platopalanca"
"platoother, other" should be untouchedm they both start different

I think you need to define the problem a little more, so that we can give a solid answer. Here are some pathological lists. Which items should get removed?:
hot, dog, hotdogstand.
hot, dog, stand, hotdogstand
hot, dogs, stand, hotdogstand
SOME CODE
This code should be more efficient than the one you have:
$words = array('hatstand','hat','stand','hot','dog','cat','hotdogstand','catbasket');
$count = count($words);
for ($i=0; $i<=$count; $i++) {
if (isset($words[$i])) {
$len_i = strlen($words[$i]);
for ($j=$i+1; $j<$count; $j++) {
if (isset($words[$j])) {
$len_j = strlen($words[$j]);
if ($len_i<=$len_j) {
if (substr($words[$j],0,$len_i)==$words[$i]) {
unset($words[$i]);
}
} else {
if (substr($words[$i],0,$len_j)==$words[$j]) {
unset($words[$j]);
}
}
}
}
}
}
foreach ($words as $word) {
echo "$word<br>";
}
You could optimise this by storing word lengths in an array before the loops.

You can take each word and see, if any word in array starts with it or ends with it. If yes - this word should be removed (unset()).

You could put the words into an array, sort the array alphabetically and then loop through it checking if the next words start with the current index, thus being composed words. If they do, you can remove the word in the current index and the latter parts of the next words...
Something like this:
$array = array('palanca', 'plato', 'platopalanca');
// ok, the example array is already sorted alphabetically, but anyway...
sort($array);
// another array for words to be removed
$removearray = array();
// loop through the array, the last index won't have to be checked
for ($i = 0; $i < count($array) - 1; $i++) {
$current = $array[$i];
// use another loop in case there are more than one combined words
// if the words are case sensitive, use strpos() instead to compare
while ($i < count($array) && stripos($array[$i + 1], $current) === 0) {
// the next word starts with the current one, so remove current
$removearray[] = $current;
// get the other word to remove
$removearray[] = substr($next, strlen($current));
$i++;
}
}
// now just get rid of the words to be removed
// for example by joining the arrays and getting the unique words
$result = array_unique(array_merge($array, $removearray));

Regex could work. You can define within the regex where the start and end of the string applies.
^ defines the start
$ defines the end
so something like
foreach($array as $value)
{
//$term is the value that you want to remove
if(preg_match('/^' . $term . '$/', $value))
{
//Here you can be confident that $term is $value, and then either remove it from
//$array, or you can add all not-matched values to a new result array
}
}
would avoid your issue
But if you are just checking that two values are equal, == will work just as well as (and possibly faster than) preg_match
In the event that the list of $terms and $values are huge this won't come out to be the most efficient of strategies, but it is a simple solution.
If performance is an issue, sorting (note the provided sort function) the lists and then iterating down the lists side by side might be more useful. I'm going to actually test that idea before I post the code here.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Finding top similar strings in PHP? - php

Without giving you all the source code to do this, you could go through the array and remove components you consider useless, like any letters with numbers, and so on. Then you can use array_count_values() and sort that array to see the top ones involved.

Related

How to get the number of characters in a one-dimensional array

match at least 5 number position at same place in given mobile number

How to count and output variables named in a specific numeric pattern

How to change order of substrings inside a larger string?

Remove composed words

Categories

Resources