Check / match string against partial strings in array

Check / match string against partial strings in array - php

I am trying to match a full UK postcode against a partial postcode.
Take a users postcode, i.e g22 1pf, and see if there's a match / partial match in the array / database.
//Sample data
$postcode_to_check= 'g401pf';
//$postcode_to_check= 'g651qr';
//$postcode_to_check= 'g51rq';
//$postcode_to_check= 'g659rs';
//$postcode_to_check= 'g40';
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$postcode_data[] = array('id' =>$counter++ , 'postcode' => $postcode, 'charge' => '20.00');
}
I do have some code but that was just comparing the strings with fixed lengths from the database. I need the strings in the array / database to be dynamic in length.
The database may contain "g22" this would be a match, it could also contain more or less of the postcode, i.e "g221" or "g221p" which would also be a match. It could contain "g221q" or "g221qr" these would not match.
Help Appreciated, Thank you
edit.
I was possibly overthinking this. the following pseudo code seems to function as expected.
check_delivery('g401pf');
//this would match because g40 is in the database.
check_delivery('g651dt');
// g651dt this would NOT match because g651dt is not in the database.
check_delivery('g524pq');
//g524pq this would match because g5 is in the database.
check_delivery('g659pf');
//g659pf this would match because g659 is in the database.
check_delivery('g655pf');
//g655pf this would not match, g665 is not in the database
//expected output, 3 matches
function check_delivery($postcode_to_check){
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$stripped_postcode = substr($postcode_to_check,0, strlen($postcode));
if($postcode==$stripped_postcode){
echo "Matched<br><br>";
break;
}
}
}

<?php
$postcode_to_check= 'g401pf';
$arr = preg_split("/\d+/",$postcode_to_check,-1, PREG_SPLIT_NO_EMPTY);
preg_match_all('/\d+/', $postcode_to_check, $postcode);
$out = implode("",array_map(function($postcode) {return implode("",$postcode);},$postcode));
$first_char = mb_substr($arr[1], 0, 1);
$MatchingPostcode=$arr[0].''.$out.''.$first_char;
echo $MatchingPostcode;
SELECT * FROM my_table WHERE column_name LIKE '%$MatchingPostcode%';
It's a dirty solution but it will solve your problem. Things like this should be handled in front-end or in the DB but if you must do it in php then this is a solution.
So this code will match you anything that includes g401p. If you don't want to match in the start of the end just remove % from which part you don't want to match. In my the case i provide you it will search for every column record that has g401p

Check the length and strip the one you want to compare it with to the same length.
function check_delivery($postcode_to_check){
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$stripped_postcode = substr($postcode_to_check,0, strlen($postcode));
if($postcode==$stripped_postcode){
echo "Matched<br><br>";
break;
}
}
}

Related

PHP - get total number of array items with a specific number sequence in value

So, basically I'm trying to count the number of landline phone numbers in a list of both landlines and mobile phone numbers $mobile_list (071234567890,02039989435,0781...)
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
echo $landlines["020..."]; // print the number of numbers
So, I get the basic count specific elements function, but I don't see where I can specify if an element 'starts with' or 'contains' a sequence. With the above you can only specify an exact phone number (obviously not useful).
Any help would be great!

I don't see any reason to first explode the string to an array, and then check each array item.
That is a complete waste of performance!
I suggest using preg_match_all and match with word boundary "020".
That means the "word" has to start with 020.
$mobile_list = "071234567890,02039989435,0781,020122,123020";
preg_match_all("/\b020\d+\b/", $mobile_list, $m);
var_dump($m);
echo count($m[0]); // 2
https://3v4l.org/ucSDm
The lightest and fastest method I have found is to explode on ",020".
The array that is returned has item 0 as undefined, meaning we don't know if it's a 020 number so I have to look at that manually.
$temp = explode(",020", $mobile_list);
$cnt = count($temp);
if(substr($temp[0],0,3) != "020") $cnt--;
echo $cnt;
A small scale test shows this as the fastest method.
https://3v4l.org/rD54d

You can use array_reduce() to count the occurrences of strings beginning with '020'
$mobile_list = "02039619491,07143502893,02088024526,07351261813,02095694897";
$mobile_array = explode(',', $mobile_list);
function landlineCount($carry, $item)
{
if (substr($item, 0, 3) === '020') {
return $carry += 1;
}
return $carry;
}
$count = array_reduce($mobile_array, 'landlineCount');
echo $count;
prints 3
I'm sure the OP has finished what they needed to do hours ago but for fun here is a faster way to count the landlines.
I hadn't spotted that the question original code was exploding the string.
That isn't necessary, you can just count the sub strings with substr_count() this could miss the first which wouldn't have a comma before it so I check for that too with substr().
If you need the total count of all numbers you can just count the commas with substr_count() again and add one.
$count = substr($mobile_list, 0, 3) === '020' ? 1 : 0;
$count += substr_count($mobile_list, ",020");
$totalCount = substr_count($mobile_list, ",") + 1;
echo $count;
echo $totalCount;
Here is the bench run a 1000 times to get an average.
https://3v4l.org/Sma66

Use array_filter() or preg_grep() functions to find all numbers that contain or starts with given number sequence.
Note: There is easier and better solution in other answers that cover request to find values that start with given number sequence.
Because you have mentioned - "but I don't see where I can specify if an element 'starts with' or 'contains' a sequence." - My code assumes that you wan't to find any occurrence of sequence, not only in start of string of each item.
$mobile_list = '02000, 02032435, 039002300, 00305600';
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
$sequence = '020'; // print the number of numbers
function filter_phone_numbers($mobile_array, $sequence){
return array_filter($mobile_array, function ($item) use ($sequence) {
if (stripos($item, $sequence) !== false) {
return true;
}
return false;
});
}
$filtered_items = array_unique (filter_phone_numbers($mobile_array, $sequence)); //use array_unique in case we find same number that both contains or starts with sequence
echo count($filtered_items);
Or with preg_grep():
$mobile_list = '02000, 02032435, 039002300, 00305600';
$mobile_array = explode(",",$mobile_list); // turn into an array
$landlines = array_count_values($mobile_array); // create count variable
$sequence = preg_quote('020', '~'); ; // print the number of numbers
function grep_phone_numbers($mobile_array, $sequence){
return preg_grep('~' . $sequence . '~', $mobile_array);
}
//use array_unique in case we find same number that both contains or starts with sequence
$filtered_items = array_unique(grep_phone_numbers($mobile_array, $sequence));
echo count($filtered_items);

I recommend doing this with the database. The database is design to manage data and can do it a lot more efficient than PHP can. You can simply put it into a query and just get the result you want in 1 go:
SELECT * FROM phone_numbers WHERE number LIKE '020%'
If you get the data from the database anyways, that LIKE adds a little time to the query, but less that it takes PHP to loop, strpos and store the results. Also, as you return a smaller dataset, less resources are being used.

Is it possible to use Knuth-Morris-Pratt Algorithm for string matching on text to text?

I have a KMP code in PHP which is can do string matching between word to text. I wonder if i can use KMP Algorithm for string matching between text to text. Is it possible or not? and how can i use it for finding the matching of the string between 2 text.
Here's the core of KMP algorithm :
<?php
class KMP{
function KMPSearch($p,$t){
$result = array();
$pattern = str_split($p);
$text = str_split($t);
$prefix = $this->preKMP($pattern);
// print_r($prefix);
// KMP String Matching
$i = $j = 0;
$num=0;
while($j<count($text)){
while($i>-1 && $pattern[$i]!=$text[$j]){
// if it doesn't match, then uses then look at the prefix table
$i = $prefix[$i];
}
$i++;
$j++;
if($i>=count($pattern)){
// if its match, find the matches string potition
// Then use prefix table to swipe to the right.
$result[$num++]=$j-count($pattern);
$i = $prefix[$i];
}
}
return $result;
}
// Making Prefix table with preKMP function
function preKMP($pattern){
$i = 0;
$j = $prefix[0] = -1;
while($i<count($pattern)){
while($j>-1 && $pattern[$i]!=$pattern[$j]){
$j = $prefix[$j];
}
$i++;
$j++;
if(isset($pattern[$i])==isset($pattern[$j])){
$prefix[$i]=$prefix[$j];
}else{
$prefix[$i]=$j;
}
}
return $prefix;
}
}
?>
I calling this class to my index.php if i want to use to find word on the text.
This is the step that i want my code do :
(1). I input a text 1
(2). I input a text 2
(3). I want a text 1 become a pattern (every single word is in text 1 treat as pattern)
(4). I want my code can find every pattern on text 1 in text 2
(5). Last, my code can show me what the percentage of similarity.
Hope you all can help me or teach me. I've been serching for the answer everywhere but can't find it yet. At least you can teach me.

If you just need to find all words that are present in both texts, you don't any string search algorithm to do it. You can just add all words from the first text to a hash table, iterate over the second text and add the words that are in a hash table to the output list.
You can use a trie instead of a hash table if you want a linear time complexity in the worst case, but I'd get started with a hash table because it's easy to use and is likely to be good enough for practical purposes.

Parsing a mixed-delimiter data set

I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?

I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.

For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.

Analysing string in php

For a project, I have an input, which consists of numbers and letters, in a specific order, send from another web page. E.g. 7 numbers for an ID, a number followed by 2 letters for groups, and 1, 2, or 3 numbers for a room number.
To seperate them, I think I have to iterate through the whole string, see for each char if it is a number or a letter, and then use a lot of if/then functions to get the correct type.
Is there a better way to do this, or is this a good way to do it.

Using a regular expression should be the best solution here, as it would both tell you if the ID does match the wanted syntax, as well as getting the several parts of this ID in an array.
For the example you gave:
$idList = [
'1AB12', // OK
'1AB123', // OK
'1AB1234', // KO
'AB1234', //KO
'12AB12', //KO
];
foreach ($idList as $id) {
$isOk = preg_match('/^([0-9])([a-zA-Z]{2})([0-9]{1,3})$/', $id, $match);
if ($isOk) {
echo 'OK : ' . $id;
var_dump($match);
} else {
echo 'KO : ' . $id;
}
}

Split strings into Dictionary words

I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only

Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.

This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()

I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.