Searching an array of different strings inside a single string in PHP - php

I have an array of strings that I want to try and match to the end of a normal string. I'm not sure the best way to do this in PHP.
This is sorta what I am trying to do:
Example:
Input: abcde
Search array: er, wr, de
Match: de
My first thought was to write a loop that goes through the array and crafts a regular expression by adding "\b" on the end of each string and then check if it is found in the input string. While this would work it seems sorta inefficient to loop through the entire array. I've been told regular expressions are slow in PHP and don't want to implement something that will take me down the wrong path.
Is there a better way to see if one of the strings in my array occurs at the end of the input string?
The preg_filter() function looks like it might do the job but is for PHP 5.3+ and I am still sticking with 5.2.11 stable.

For something this simple, you don't need a regex. You can either loop over the array, and use strpos to see if the index is length(input) - length(test). If each entry in the search array is always of a constant length, you can also speed things up by chopping the end off the input, then comparing that to each item in the array.
You can't avoid going through the whole array, as in the worst general case, the item that matches will be at the end of the array. However, unless the array is huge, I wouldn't worry too much about performance - it will be much faster than you think.

Though compiling the regular expression takes some time I wouldn't dismiss using pcre so easily. Unless you find a compare function that takes several needles you need a loop for the needles and executing the loop + calling the compare function for each single needle takes time, too.
Let's take a test script that fetches all the function names from php.net and looks for certain endings. This was only an adhoc script but I suppose no matter which strcmp-ish function + loop you use it will be slower than the simple pcre pattern (in this case).
count($hs)=5549
pcre: 4.377925157547 s
substr_compare: 7.951938867569 s
identical results: bool(true)
This was the result when search for nine different patterns. If there were only two ('yadda', 'ge') both methods took the same time.
Feel free to criticize the test script (aren't there always errors in synthetic tests that are obvious for everyone but oneself? ;-) )
<?php
/* get the test data
All the function names from php.net
*/
$doc = new DOMDocument;
$doc->loadhtmlfile('http://docs.php.net/quickref.php');
$xpath = new DOMXPath($doc);
$hs = array();
foreach( $xpath->query('//a') as $a ) {
$hs[] = $a->textContent;
}
echo 'count($hs)=', count($hs), "\n";
// should find:
// ge, e.g. imagick_adaptiveblurimage
// ing, e.g. m_setblocking
// name, e.g. basename
// ions, e.g. assert_options
$ns = array('yadda', 'ge', 'foo', 'ing', 'bar', 'name', 'abcd', 'ions', 'baz');
sleep(1);
/* test 1: pcre */
$start = microtime(true);
for($run=0; $run<100; $run++) {
$matchesA = array();
$pattern = '/(?:' . join('|', $ns) . ')$/';
foreach($hs as $haystack) {
if ( preg_match($pattern, $haystack, $m) ) {
#$matchesA[$m[0]]+= 1;
}
}
}
echo "pcre: ", microtime(true)-$start, " s\n";
flush();
sleep(1);
/* test 2: loop + substr_compare */
$start = microtime(true);
for($run=0; $run<100; $run++) {
$matchesB = array();
foreach( $hs as $haystack ) {
$hlen = strlen($haystack);
foreach( $ns as $needle ) {
$nlen = strlen($needle);
if ( $hlen >= $nlen && 0===substr_compare($haystack, $needle, -$nlen) ) {
#$matchesB[$needle]+= 1;
}
}
}
}
echo "substr_compare: ", microtime(true)-$start, " s\n";
echo 'identical results: '; var_dump($matchesA===$matchesB);

I might approach this backwards;
if your string-ending list is fixed or varies rarely,
I would start by preprocessing it to make it easy to match against,
then grab the end of your string and see if it matches!
Sample code:
<?php
// Test whether string ends in predetermined list of suffixes
// Input: string to test
// Output: if matching suffix found, returns suffix as string, else boolean false
function findMatch($str) {
$matchTo = array(
2 => array( 'ge' => true, 'de' => true ),
3 => array( 'foo' => true, 'bar' => true, 'baz' => true ),
4 => array( 'abcd' => true, 'efgh' => true )
);
foreach($matchTo as $length => $list) {
$end = substr($str, -$length);
if (isset($list[$end]))
return $end;
}
return $false;
}
?>

This might be an overkill but you can try the following.
Create a hash for each entry of your search array and store them as keys in the array (that will be your lookup array).
Then go from the end of your input string one character at time (e, de,cde and etc) and compute a hash on a substring at each iteration. If a hash is in your lookup array, you have much.

Related

array_filter with element modification [duplicate]

This question already has answers here:
Explode string on commas and trim potential spaces from each value
(11 answers)
Closed 6 months ago.
I'm trying to make a clean array from a string that my users will define.
The string can contain non-valid IDs, spaces, etc. I'm checking the elements using a value object in a callback function for array_filter.
$definedIds = "123,1234,1243, 12434 , asdf"; //from users panel
$validIds = array_map(
'trim',
array_filter(
explode(",", $definedIds),
function ($i) {
try {
new Id(trim($i));
return true;
} catch (\Exception $e) {
return false;
}
}
)
);
This works fine, but I'm applying trim twice. Is there a better way to do this or a different PHP function in which I can modify the element before keeping it in the returned array?
NOTE: I also could call array_map in the first parameter of array_filter, but I would be looping through the array twice anyway.
It depends on whether you care about performance. If you do, don't use map+filter, but use a plain for loop and manipulate your array in place:
$arr = explode(',', $input);
for($i=count($arr)-1; $i>=0; $i--) {
// make this return trimmed string, or false,
// and have it trim the input instead of doing
// that upfront before passing it into the function.
$v = $arr[$i] = Id.makeValid($arr[$i]);
// weed out invalid ids
if ($v === false) {
array_splice($arr, $i, 1);
}
}
// at this point, $arr only contains valid, cleaned ids
Of course, if this is inconsequential code, then trimming twice is really not going to make a performance difference, but you can still clean things up:
$arr = explode(',', $input);
$arr = array_filter(
array_map('Id.isValidId', $arr),
function ($i) {
return $i !== false;
}
);
In this example we first map using that function, so we get an array of ids and false values, and then we filter that so that everything that's false gets thrown away, rather than first filtering, and then mapping.
(In both cases the code that's responsible for checking validity is in the Id class, and it either returns a cleaned id, or false)
Actually you can do it by different way but If I were you then I'll do it this way. Here I just used only one trim
<?php
$definedIds = "123,1234,1243, 12434 , asdf"; //from users panel
function my_filter($b){
if(is_numeric($b)){
return true;
}
}
print '<pre>';
$trimmed = array_map('trim',explode(',',$definedIds));
print_r(array_filter($trimmed,my_filter));
print '</pre>';
?>
Program Output:
Array
(
[0] => 123
[1] => 1234
[2] => 1243
[3] => 12434
)
DEMO: https://eval.in/997812

php challenge: parse pseudo-regex

I have a challenge that I have not been able to figure out, but it seems like it could be fun and relatively easy for someone who thinks in algorithms...
If my search term has a "?" character in it, it means that it should not care if the preceding character is there (as in regex). But I want my program to print out all the possible results.
A few examples: "tab?le" should print out "table" and "tale". The number of results is always 2 to the power of the number of question marks. As another example: "carn?ati?on" should print out:
caraton
caration
carnaton
carnation
I'm looking for a function that will take in the word with the question marks and output an array with all the results...
Following your example of "carn?ati?on":
You can split the word/string into an array on "?" then the last character of each string in the array will be the optional character:
[0] => carn
[1] => ati
[2] => on
You can then create the two separate possibilities (ie. with and without that last character) for each element in the first array and map these permutations to another array. Note the last element should be ignored for the above transformation since it doesn't apply. I would make it of the form:
[0] => [carn, car]
[1] => [ati, at]
[2] => [on]
Then I would iterate over each element in the sub arrays to compute all the different combinations.
If you get stuck in applying this process just post a comment.
I think a loop like this should work:
$to_process = array("carn?ati?on");
$results = array();
while($item = array_shift($to_process)) {
$pos = strpos($item,"?");
if( $pos === false) {
$results[] = $item;
}
elseif( $pos === 0) {
throw new Exception("A term (".$item.") cannot begin with ?");
}
else {
$to_process[] = substr($item,0,$pos).substr($item,$pos+1);
$to_process[] = substr($item,0,$pos-1).substr($item,$pos+1);
}
}
var_dump($results);

Finding top similar strings in PHP?

I have an array of 17,000 strings. Many of the strings have similar matches, for example:
User Report XYZ123
Bob Smith
User Report YEI723
User Report
User Report
Number of Hits 27
Frank's Weekly Transaction Report
Transaction Report 123
What is the best way to find the top "similar strings"? For instance, using the example above, I would want to see "User Report" and "Transaction Report" as two of the top "similar strings".
Without giving you all the source code to do this, you could go through the array and remove components you consider useless, like any letters with numbers, and so on.
Then you can use array_count_values() and sort that array to see the top ones involved.
I guess you could do a foreach through each of the strings and eliminate the ones that you don't want for that particular search. Then go through the once you have left (possibly with another foreach) and keep shrinking the number of strings that you have an interest in down until there are just a few. Then sort those by something like alphabetical order.
You could compute the Levenstein distance for each string compared with others and then sort them by that value.
$strings = array('str1', 'str2', 'car', 'dog', 'apple', 'house', 'str3');
$len = count($strings);
$distances = array_fill(0, $len, 0);
for($i=0; $i<$len-1; ++$i)
for($j=$i+1; $j<$len; ++$j)
{
$dist = levenshtein($strings[$i], $strings[$j]);
$distances[$i] += $dist;
$distances[$j] += $dist;
}
// Here $distances indicates how of "similar" is each string
// The lower values are more "similar"
If you are able to get all the strings as an array and loop them in a foreach() like this:
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = 'string';
$results = array();
foreach($string_array as $key => $val):
if (fnmatch($needle, $val):
$results[] = $val;
endif;
endforeach;
in the end you should end having the entries that match $needle. As alternative to fnmatch() you could use preg_match() and as pattern /string/i
$string_array = array('string', 'string1', 'string2', 'does-not-match');
$needle = '/string/i';
$results = array();
foreach($string_array as $key => $val):
if (!empty(preg_match($needle, $val)):
$results[] = $val;
endif;
endforeach;
Note there could be issues when using empty() and pass the result of preg_match().:
Prior to PHP 5.5, empty() only supports variables; anything else will result in a parse error. In other words, the following will not work: empty(trim($name)). Instead, use trim($name) == false.
No errors should be issued with PHP version 5.3.x < 5.4

Get count of substring occurrence in an array

I would like to get a count of how many times a substring occurs in an array. This is for a Drupal site so I need to use PHP code
$ar_holding = array('usa-ny-nyc','usa-fl-ftl', 'usa-nj-hb',
'usa-ny-wch', 'usa-ny-li');
I need to be able to call a function like foo($ar_holding, 'usa-ny-'); and have it return 3 from the $ar_holding array. I know about the in_array() function but that returns the index of the first occurrence of a string. I need the function to search for substrings and return a count.
You could use preg_grep():
$count = count( preg_grep( "/^usa-ny-/", $ar_holding ) );
This will count the number of values that begin with "usa-ny-". If you want to include values that contain the string at any position, remove the caret (^).
If you want a function that can be used to search for arbitrary strings, you should also use preg_quote():
function foo ( $array, $string ) {
$string = preg_quote( $string, "/" );
return count( preg_grep( "/^$string/", $array ) );
}
If you need to search from the beginning of the string, the following works:
$ar_holding = array('usa-ny-nyc','usa-fl-ftl', 'usa-nj-hb',
'usa-ny-wch', 'usa-ny-li');
$str = '|'.implode('|', $ar_holding);
echo substr_count($str, '|usa-ny-');
It makes use of the implode function to concat all array values with the | character in between (and before the first element), so you can search for this prefix with your search term. substr_count does the dirty work then.
The | acts as a control character, so it can not be part of the values in the array (which is not the case), just saying in case your data changes.
$count = subtr_count(implode("\x00",$ar_holding),"usa-ny-");
The \x00 is to be almost-certain that you won't end up causing overlaps that match by joining the array together (the only time it can happen is if you're searching for null bytes)
I don't see any reason to overcomplicate this task.
Iterate the array and add 1 to the count everytime a value starts with the search string.
Code: (Demo: https://3v4l.org/5Lq3Y )
function foo($ar_holding, $starts_with) {
$count = 0;
foreach ($ar_holding as $v) {
if (strpos($v, $starts_with)===0) {
++$count;
}
}
return $count;
}
$ar_holding = array('usa-ny-nyc','usa-fl-ftl', 'usa-nj-hb',
'usa-ny-wch', 'usa-ny-li');
echo foo($ar_holding, "usa-ny-"); // 3
Or if you don't wish to declare any temporary variables:
function foo($ar_holding, $starts_with) {
return sizeof(
array_filter($ar_holding, function($v)use($starts_with){
return strpos($v, $starts_with)===0;
})
);
}

How to match rows in array to an array of masks?

I have array like this:
array('1224*', '543*', '321*' ...) which contains about 17,00 "masks" or prefixes.
I have a second array:
array('123456789', '123456788', '987654321' ....) which contain about 250,000 numbers.
Now, how can I efficiently match every number from the second array using the array of masks/prefixes?
[EDIT]
The first array contains only prefixes and every entry has only one * at the end.
Well, here's a solution:
Prelimary steps:
Sort array 1, cutting off the *'s.
Searching:
For each number in array 2 do
Find the first and last entry in array 1 of which the first character matches that of number (binary search).
Do the same for the second character, this time searching not the whole array but between first and last (binary search).
Repeat 2 for the nth character until a string is found.
This should be O(k*n*log(n)) where n is the average number length (in digits) and k the number of numbers.
Basically this is a 1 dimensional Radix tree, for optimal performance you should implement it, but it can be quite hard.
My two cents....
$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');
foreach ($f as $haystack) {
echo $haystack."<br>";
foreach ($s as $needle) {
$needle = str_replace("*","",$needle);
echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
}
}
function startsWith($haystack, $needle) {
$length = strlen($needle);
return (substr($haystack, 0, $length) === $needle);
}
To improve performance it might be a good idea to sort both arrays first and to add an exit clause in the inner foreach loop.
By the way, the startWith-function is from this great solution in SO: startsWith() and endsWith() functions in PHP
Another option would to be use preg_grep in a loop:
$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);
$matches = array();
foreach($masks as $mask) {
$mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
$matches[$mask] = preg_grep("/^$mask/", $data);
}
No idea how efficient this would be, just offering it up as an alternative.
Although regex is not famous for being fast, I'd like to know how well preg_grep() can perform if the pattern is boiled down to its leanest form and only called once (not in a loop).
By removing longer masks which are covered by shorter masks, the pattern will be greatly reduced. How much will the reduction be? of course, I cannot say for sure, but with 17,000 masks, there are sure to be a fair amount of redundancy.
Code: (Demo)
$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
if (strpos($mask, $needle) !== 0) {
$needle = rtrim($mask, '*');
$keep[] = $needle;
}
}
// now $keep only contains: ['1224', '321', '5']
$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];
var_export(
preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);
Output:
array (
0 => '122456789',
2 => '321876543234567',
3 => '55555555555555555',
)
Check out the PHP function array_intersect_key.

Categories