Find circular replacement patterns - php

Suppose I would like to expand a string by replacing placeholders from a dictionary. The replacement strings can also contain placeholders:
$pattern = "#a# #b#";
$dict = array("a" => "foo", "b" => "bar #c#", "c" => "baz");
while($match_count = preg_match_all('/#([^#])+#/', $pattern, $matches)) {
for($i=0; $i<$match_count; $i++) {
$key = $matches[1][$i];
if(!isset($dict[$key])) { throw new Exception("'$key' not found!"); }
$pattern = str_replace($matches[0][$i], $dict[$key], $pattern);
}
}
echo $pattern;
This works fine, as long as there are no circular replacement patterns, for example "c" => "#b#". Then the program will be thrown into an endless loop until the memory is exhausted.
Is there an easy way to detect such patterns? I'm looking for a solution where the distance between the replacements can be arbitrarily long, eg. a->b->c->d->f->a
Ideally, the solution would also happen while in the loop and not with a separate analysis.

Single character keys
This is quite easy if the keys are single characters: simply check if a string at the value side contains a character that is a key.
foreach ($your_array as $key => $value) {
foreach(str_split($value) as $ch) {
if(array_key_exists ($ch,$your_array) {
#Problem, cycle is possible
}
}
}
#We're fine
Now even if there is cycle, that does not mean it is fired on every string (for instance in the empty string, no patterns will be fired thus no cycles). In that case you can incorporate it into your checker: if a rule is fired for the second time, there is a problem. Simply because if this is the case, the previous pattern has generated an occasion for this, thus the ocasion will be generated over and over again.
String keys
In case the keys are strings as well, this is probably the Post Correspondence Problem which is undecidable...

Thanks to the comment by georg and this post, I came up with a solution that converts the patterns into a graph and uses topological sort to check for circular replacement.
Here is my solution:
$dict = array("a" => "foo", "b" => "bar #c#", "c" => "baz #b#");
# Store incoming and outgoing "connections" for each key => pattern replacement
$nodes = array();
foreach($dict as $patternName => $pattern) {
if (!isset($nodes[$patternName])) {
$nodes[$patternName] = array("in" => array(), "out" => array());
}
$match_count = preg_match_all('/#([^#])+#/', $pattern, $matches);
for ($i=0; $i<$match_count; $i++) {
$key = $matches[1][$i];
if (!isset($dict[$key])) { throw new Exception("'$key' not found!"); }
if (!isset($nodes[$key])) {
$nodes[$key] = array("in" => array(), "out" => array());
}
$nodes[$key]["in"][] = $patternName;
$nodes[$patternName]["out"][] = $key;
}
}
# collect leaf nodes (no incoming connections)
$leafNodes = array();
foreach ($nodes as $key => $connections) {
if (empty($connections["in"])) {
$leafNodes[] = $key;
}
}
# Remove leaf nodes until none are left
while (!empty($leafNodes)) {
$nodeID = array_shift($leafNodes);
foreach ($nodes[$nodeID]["out"] as $outNode) {
$nodes[$outNode]['in'] = array_diff($nodes[$outNode]['in'], array($nodeID));
if (empty($nodes[$outNode]['in'])) {
$leafNodes[] = $outNode;
}
}
$nodes[$nodeID]['out'] = array();
}
# Check for non-leaf nodes. If any are left, there is a circular pattern
foreach ($nodes as $key => $node) {
if (!empty($node["in"]) || !empty($node["out"]) ) {
throw new Exception("Circular replacement pattern for '$key'!");
}
}
# Now we can safely do replacement
$pattern = "#a# #b#";
while ($match_count = preg_match_all('/#([^#])+#/', $pattern, $matches)) {
$key = $matches[1][$i];
$pattern = str_replace($matches[0][$i], $dict[$key], $pattern);
}
echo $pattern;

Related

Generate all possible matches for regex pattern in PHP

There are quite a few questions on SO asking about how to parse a regex pattern and output all possible matches to that pattern. For some reason, though, every single one of them I can find (1, 2, 3, 4, 5, 6, 7, probably more) are either for Java or some variety of C (and just one for JavaScript), and I currently need to do this in PHP.
I’ve Googled to my heart’s (dis)content, but whatever I do, pretty much the only thing that Google gives me is links to the docs for preg_match() and pages about how to use regex, which is the opposite of what I want here.
My regex patterns are all very simple and guaranteed to be finite; the only syntax used is:
[] for character classes
() for subgroups (capturing not required)
| (pipe) for alternative matches within subgroups
? for zero-or-one matches
So an example might be [ct]hun(k|der)(s|ed|ing)? to match all possible forms of the verbs chunk, thunk, chunder and thunder, for a total of sixteen permutations.
Ideally, there’d be a library or tool for PHP which will iterate through (finite) regex patterns and output all possible matches, all ready to go. Does anyone know if such a library/tool already exists?
If not, what is an optimised way to approach making one? This answer for JavaScript is the closest I’ve been able to find to something I should be able to adapt, but unfortunately I just can’t wrap my head around how it actually works, which makes adapting it more tricky. Plus there may well be better ways of doing it in PHP anyway. Some logical pointers as to how the task would best be broken down would be greatly appreciated.
Edit: Since apparently it wasn’t clear how this would look in practice, I am looking for something that will allow this type of input:
$possibleMatches = parseRegexPattern('[ct]hun(k|der)(s|ed|ing)?');
– and printing $possibleMatches should then give something like this (the order of the elements is not important in my case):
Array
(
[0] => chunk
[1] => thunk
[2] => chunks
[3] => thunks
[4] => chunked
[5] => thunked
[6] => chunking
[7] => thunking
[8] => chunder
[9] => thunder
[10] => chunders
[11] => thunders
[12] => chundered
[13] => thundered
[14] => chundering
[15] => thundering
)
Method
You need to strip out the variable patterns; you can use preg_match_all to do this
preg_match_all("/(\[\w+\]|\([\w|]+\))/", '[ct]hun(k|der)(s|ed|ing)?', $matches);
/* Regex:
/(\[\w+\]|\([\w|]+\))/
/ : Pattern delimiter
( : Start of capture group
\[\w+\] : Character class pattern
| : OR operator
\([\w|]+\) : Capture group pattern
) : End of capture group
/ : Pattern delimiter
*/
You can then expand the capture groups to letters or words (depending on type)
$array = str_split($cleanString, 1); // For a character class
$array = explode("|", $cleanString); // For a capture group
Recursively work your way through each $array
Code
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = '[ct]hun(k|der)(s|ed|ing)?';
$matchPattern = "/(\[\w+\]|\([\w|]+\))\??/";
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Additional functionality
Expanding nested groups
In use you would put this before the "preg_match_all".
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
echo preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex), PHP_EOL;
Output:
This happen(s|ed) to (become|be|have|having) test case 1?
Matching single letters
The bones of this would be to update the regex:
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
and add an else to the prepOptions function:
} else {
$array = [$cleanString];
}
Full working example
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
} else {
$array = [$cleanString];
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
$regex = preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex);
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Output:
This happens to become test case 1
This happens to become test case
This happens to be test case 1
This happens to be test case
This happens to have test case 1
This happens to have test case
This happens to having test case 1
This happens to having test case
This happened to become test case 1
This happened to become test case
This happened to be test case 1
This happened to be test case
This happened to have test case 1
This happened to have test case
This happened to having test case 1
This happened to having test case

Multi-dimensional array search for part of word or phrase and remove key

Before I start, I am learning and I am not going to claim to be a PHP expert. I've tried several different things but this method has gotten me closest to what I am looking for.
I have a JSON array that I am looking to search through and if part of the text matches any part of a line (Alerts) in the array, remove the whole key from the array. (If possible, I just want this to match the latest key and not remove all keys that have a match)
The code below is working on the latest item in the array but can't search an older record.
For example,
[8] => Array
(
[Code] => 9
[Alerts] => bob went away
)
[9] => Array
(
[Code] => 9
[Alerts] => randy jumped in the air
)
)
If I call the script, with the term of 'bob' it will find nothing. If I call the script with the term 'randy' it will work perfectly deleting key 9. I can then search for a term of 'bob' and it will remove key 8.
Here is what I have so far. (Again there might be a better way)
<?php
$jsondata = file_get_contents('myfile.json');
$json = json_decode($jsondata, true);
$done = 'term';
$pattern = preg_quote($done, '/');
$pattern = "/^.*$pattern.*\$/m";
$arr_index = array();
foreach ($json as $key => $value)
$contents = $value['Alerts'];
{
if(preg_match($pattern, $contents, $matches))
{
$trial = implode($matches);
}
if ($contents == $trial)
{
$arr_index[] = $key;
}
}
foreach ($arr_index as $i)
{
unset($json[$i]);
}
$json = array_values($json);
file_put_contents('myfile-test.json', json_encode($json));
echo $trial; //What did our search come up with?
die;
}
Thanks again!
The problem is that the code that uses $contents is not inside the foreach loop. The loop just has one statement in its body:
$contents = $value['Alerts'];
When the loop ends, $contents contains the last alert value, and then that's used in the code block after it.
You need to put that statement inside the braces.
<?php
$jsondata = file_get_contents('myfile.json');
$json = json_decode($jsondata, true);
$done = 'term';
$pattern = preg_quote($done, '/');
$pattern = "/^.*$pattern.*\$/m";
$arr_index = array();
foreach ($json as $key => $value)
{
$contents = $value['Alerts'];
if(preg_match($pattern, $contents, $matches))
{
$trial = implode($matches);
}
if ($contents == $trial)
{
$arr_index[] = $key;
}
}
foreach ($arr_index as $i)
{
unset($json[$i]);
}
$json = array_values($json);
file_put_contents('myfile-test.json', json_encode($json));
echo $trial; //What did our search come up with?
die;
}
You should use your editor's feature to automatically indent code, it will make problems like this more obvious.
I was actually able to get it to work by using the following if anyone ever needs it. It's a little more simple than I was originally going about it. This will stop on the first key of found text. To remove all records, simply remove the "break" and it will remove all keys containing said text or phrase.
$pattern = preg_quote($done, '/');
$pattern = "/^.*$pattern.*\$/m";
$arr_index = array();
foreach ($json as $key => $contents)
{
if(preg_match($pattern, $contents['Alerts'], $matches)) {
unset($json[$key]);
break;
}
}
$json = array_values($json);
file_put_contents('myfile-test.json', json_encode($json));

PHP regex capture template tags and if statement tags

I have tags in a html file like this, placed throughout;
*|SUBJECT|*
*|SUBJECT|*
*|IFNOT:ARCHIVE_PAGE|*
*|ARCHIVE|*
*|END:IF|*
*|FACEBOOK:PROFILEURL|*
*|TWITTER:PROFILEURL|*
*|FORWARD|*
*|IF:REWARDS*
*|REWARDS|*
*|END:IF|*
Using this PHP function and regex i can get the results of all the tags
preg_match_all("/\*\|(.*?)\|\*/", $this->template, $elements);
$this->elements["Tags"] = $elements[0];
$this->elements["TagNames"] = $elements[1];
What i want, is to find a way to capture the IF:(TAG) statements and IFNOT:(TAG) statements as well and the content.
What i have so far is
ergex=> /\*\|IF(([A-Z{0-3}]):([A-Z_]+))\|\*(.*?)\*\|END:IF\|\*|\*\|(.*?)\|\*/g
But it only catches the tags them self as a whole, can anyone point me in the right direction or help me out.
As I mentioned in the comments you approach is to simplistic, I can get you started using the method I use for these things. It's more a tokenizer/lexer/parser methodology.
That sounds big and scary but it actually makes it simpler
<?php
function parse($subject, $tokens)
{
$types = array_keys($tokens);
$patterns = [];
$lexer_stream = [];
$result = false;
foreach ($tokens as $k=>$v){
$patterns[] = "(?P<$k>$v)";
}
$pattern = "/".implode('|', $patterns)."/i";
if (preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE)) {
//print_r($matches);
foreach ($matches[0] as $key => $value) {
$match = [];
foreach ($types as $type) {
$match = $matches[$type][$key];
if (is_array($match) && $match[1] != -1) {
break;
}
}
$tok = [
'content' => $match[0],
'type' => $type,
'offset' => $match[1]
];
$lexer_stream[] = $tok;
}
$result = parseTokens( $lexer_stream );
}
return $result;
}
function parseTokens( array &$lexer_stream ){
$result = [];
$mode = 'none';
while($current = current($lexer_stream)){
$content = $current['content'];
$type = $current['type'];
switch($type){
case 'T_WHITESPACE':
next($lexer_stream);
break;
case 'T_TAG_START':
$mode = 'start';
next($lexer_stream);
break;
case 'T_WORD':
if($mode == 'start') echo "Tag $content\n";
if($mode == 'ifnot') echo "IfNot $content\n";
next($lexer_stream);
break;
case 'T_TAG_END':
$mode = 'none';
next($lexer_stream);
break;
case 'T_IFNOT':
$mode = 'ifnot';
next($lexer_stream);
break;
case 'T_EOF': return;
case 'T_UNKNOWN':
default:
print_r($current);
trigger_error("Unknown token $type value $content", E_USER_ERROR);
}
}
if( !$current ) return;
print_r($current);
trigger_error("Unclosed item $mode for $type value $content", E_USER_ERROR);
}
$subject = '*|SUBJECT|*
*|SUBJECT|*
*|IFNOT:ARCHIVE_PAGE|*
*|ARCHIVE|*
*|END:IF|*
*|FACEBOOK:PROFILEURL|*
*|TWITTER:PROFILEURL|*
*|FORWARD|*
*|IF:REWARDS*
*|REWARDS|*
*|END:IF|*';
$tokens = [
'T_WHITESPACE' => '[\r\n\s\t]+',
'T_TAG_START' => '\*\|',
'T_TAG_END' => '\|\*',
'T_IF' => 'IF:',
'T_IFNOT' => 'IFNOT:',
'T_ENDIF' => 'END:IF',
'T_WORD' => '\w+',
'T_EOF' => '\Z',
'T_UNKNOWN' => '.+?'
];
parse($subject,$tokens);
So this you can see here
And it outputs:
Tag SUBJECT
Tag SUBJECT
IfNot ARCHIVE_PAGE
Tag ARCHIVE
Array
(
[content] => END:IF
[type] => T_ENDIF
[offset] => 69
)
<br />
<b>Fatal error</b>: Unknown token T_ENDIF value END:IF in <b>[...][...]</b> on line <b>67</b><br />
The error is because I only worked it out to the End if tag (have to leave something for you to do).
For a parser that I did using this for another question you can find that on my github
https://github.com/ArtisticPhoenix/MISC/blob/master/JasonDecoder.php
It should give you some ideas on how to handle nested array like structures etc..
The basic idea is you can just add one tag at a time, and then do the parsing for that one tag. You can do as much or as little error checking on different mode types in the wrong place, and so on. It just gives everything a nice structure to work from. Essential it works the same basic way you method works, using preg_match_all and a string of regx's. The main difference is that it builds the full regx from an array, and then using the array keys and named capture groups (and a bit of code magic) it lets you reference them in a more intuitive way. It also uses the PREG_OFFSET_CAPTURE flag, which I found to be faster then the other flags.
One note is that the order of the tags is important, if you put the T_UNKNOWN tag first it matches everything so it won't go to the tags below it. Therefor they should be the more specific the match the higher in the list. For example you could do a tag like this
'T_IFNOT' => '\*\|IFNOT:',
Instead of the one I have, but it would likely have to go before the:
'T_TAG_START' => '\*\|',
Because that tag will match it first.
Also don't forget to put next($lexer_stream); or it will be an infinite loop. It's necessary to use while and next to control the array pointer when in nested structures like arrays.
Good Luck and happy parsing!

Parse string containing dots in php

I would parse the following string:
$str = 'ProceduresCustomer.tipi_id=10&ProceduresCustomer.id=1';
parse_str($str,$f);
I wish that $f be parsed into:
array(
'ProceduresCustomer.tipi_id' => '10',
'ProceduresCustomer.id' => '1'
)
Actually, the parse_str returns
array(
'ProceduresCustomer_tipi_id' => '10',
'ProceduresCustomer_id' => '1'
)
Beside writing my own function, does anybody know if there is a php function for that?
From the PHP Manual:
Dots and spaces in variable names are converted to underscores. For example <input name="a.b" /> becomes $_REQUEST["a_b"].
So, it is not possible. parse_str() will convert all periods to underscores. If you really can't avoid using periods in your query variable names, you will have to write custom function to achieve this.
The following function (taken from this answer) converts the names of each key-value pair in the query string to their corresponding hexadecimal form and then does a parse_str() on it. Then, they're reverted back to their original form. This way, the periods aren't touched:
function parse_qs($data)
{
$data = preg_replace_callback('/(?:^|(?<=&))[^=[]+/', function($match) {
return bin2hex(urldecode($match[0]));
}, $data);
parse_str($data, $values);
return array_combine(array_map('hex2bin', array_keys($values)), $values);
}
Example usage:
$data = parse_qs($_SERVER['QUERY_STRING']);
Quick 'n' dirty.
$str = "ProceduresCustomer.tipi_id=10&ProceduresCustomer.id=1";
function my_func($str){
$expl = explode("&", $str);
foreach($expl as $r){
$tmp = explode("=", $r);
$out[$tmp[0]] = $tmp[1];
}
return $out;
}
var_dump(my_func($str));
array(2) {
["ProceduresCustomer.tipi_id"]=> string(2) "10"
["ProceduresCustomer.id"]=>string(1) "1"
}
This quick-made function attempts to properly parse the query string and returns an array.
The second (optional) parameter $break_dots tells the parser to create a sub-array when encountering a dot (this goes beyond the question, but I included it anyway).
/**
* parse_name -- Parses a string and returns an array of the key path
* if the string is malformed, only return the original string as a key
*
* $str The string to parse
* $break_dot Whether or not to break on dots (default: false)
*
* Examples :
* + parse_name("var[hello][world]") = array("var", "hello", "world")
* + parse_name("var[hello[world]]") = array("var[hello[world]]") // Malformed
* + parse_name("var.hello.world", true) = array("var", "hello", "world")
* + parse_name("var.hello.world") = array("var.hello.world")
* + parse_name("var[hello][world") = array("var[hello][world") // Malformed
*/
function parse_name ($str, $break_dot = false) {
// Output array
$out = array();
// Name buffer
$buf = '';
// Array counter
$acount = 0;
// Whether or not was a closing bracket, in order to avoid empty indexes
$lastbroke = false;
// Loop on chars
foreach (str_split($str) as $c) {
switch ($c) {
// Encountering '[' flushes the buffer to $out and increments the
// array counter
case '[':
if ($acount == 0) {
if (!$lastbroke) $out[] = $buf;
$buf = "";
$acount++;
$lastbroke = false;
// In this case, the name is malformed. Return it as-is
} else return array($str);
break;
// Encountering ']' flushes rge buffer to $out and decrements the
// array counter
case ']':
if ($acount == 1) {
if (!$lastbroke) $out[] = $buf;
$buf = '';
$acount--;
$lastbroke = true;
// In this case, the name is malformed. Return it as-is
} else return array($str);
break;
// If $break_dot is set to true, flush the buffer to $out.
// Otherwise, treat it as a normal char.
case '.':
if ($break_dot) {
if (!$lastbroke) $out[] = $buf;
$buf = '';
$lastbroke = false;
break;
}
// Add every other char to the buffer
default:
$buf .= $c;
$lastbroke = false;
}
}
// If the counter isn't back to 0 then the string is malformed. Return it as-is
if ($acount > 0) return array($str);
// Otherwise, flush the buffer to $out and return it.
if (!$lastbroke) $out[] = $buf;
return $out;
}
/**
* decode_qstr -- Take a query string and decode it to an array
*
* $str The query string
* $break_dot Whether or not to break field names on dots (default: false)
*/
function decode_qstr ($str, $break_dots = false) {
$out = array();
// '&' is the field separator
$a = explode('&', $str);
// For each field=value pair:
foreach ($a as $param) {
// Break on the first equal sign.
$param = explode('=', $param, 2);
// Parse the field name
$key = parse_name($param[0], $break_dots);
// This piece of code creates the array structure according to th
// decomposition given by parse_name()
$array = &$out; // Reference to the last object. Starts to $out
$append = false; // If an empty key is given, treat it like $array[] = 'value'
foreach ($key as $k) {
// If the current ref isn't an array, make it one
if (!is_array($array)) $array = array();
// If the current key is empty, break the loop and append to current ref
if (empty($k)) {
$append = true;
break;
}
// If the key isn't set, set it :)
if (!isset($array[$k])) $array[$k] = NULL;
// In order to walk down the array, we need to first save the ref in
// $array to $tmp
$tmp = &$array;
// Deletes the ref from $array
unset($array);
// Create a new ref to the next item
$array =& $tmp[$k];
// Delete the save
unset($tmp);
}
// If instructed to append, do that
if ($append) $array[] = $param[1];
// Otherwise, just set the value
else $array = $param[1];
// Destroy the ref for good
unset($array);
}
// Return the result
return $out;
}
I tried to correctly handle multi-level keys. The code is a bit hacky, but it should work. I tried to comment the code, comment if you have any question.
Test case:
var_dump(decode_qstr("ProceduresCustomer.tipi_id=10&ProceduresCustomer.id=1"));
// array(2) {
// ["ProceduresCustomer.tipi_id"]=>
// string(2) "10"
// ["ProceduresCustomer.id"]=>
// string(1) "1"
// }
var_dump(decode_qstr("ProceduresCustomer.tipi_id=10&ProceduresCustomer.id=1", true));
// array(1) {
// ["ProceduresCustomer"]=>
// array(2) {
// ["tipi_id"]=>
// string(2) "10"
// ["id"]=>
// string(1) "1"
// }
// }
I would like to add my solution as well, because I had trouble finding one that did all I needed and would handle all circumstances. I tested it quite thoroughly. It keeps dots and spaces and unmatched square brackets (normally changed to underscores), plus it handles arrays in the input well. Tested in PHP 8.0.0 and 8.0.14.
const periodPlaceholder = 'QQleQPunT';
const spacePlaceholder = 'QQleQSpaTIE';
function parse_str_clean($querystr): array {
// without the converting of spaces and dots etc to underscores.
$qquerystr = str_ireplace(['.','%2E','+',' ','%20'], [periodPlaceholder,periodPlaceholder,spacePlaceholder,spacePlaceholder,spacePlaceholder], $querystr);
$arr = null ; parse_str($qquerystr, $arr);
sanitizeArr($arr, $querystr);
return $arr;
}
function sanitizeArr(&$arr, $querystr) {
foreach($arr as $key=>$val) {
// restore values to original
if ( is_string($val)) {
$newval = str_replace([periodPlaceholder,spacePlaceholder], ["."," "], $val);
if ( $val != $newval) $arr[$key]=$newval;
}
}
unset($val);
foreach($arr as $key=>$val) {
$newkey = str_replace([periodPlaceholder,spacePlaceholder], ["."," "], $key);
if ( str_contains($newkey, '_') ) {
// periode of space or [ or ] converted to _. Restore with querystring
$regex = '/&('.str_replace('_', '[ \.\[\]]', preg_quote($newkey, '/')).')=/';
$matches = null ;
if ( preg_match_all($regex, "&".urldecode($querystr), $matches) ) {
if ( count(array_unique($matches[1])) === 1 && $key != $matches[1][0] ) {
$newkey = $matches[1][0] ;
}
}
}
if ( $newkey != $key ) $arr = array_replace_key($arr,$key, $newkey);
if ( is_array($val)) {
sanitizeArr($arr[$newkey], $querystr);
}
}
}
function array_replace_key($array, $oldKey, $newKey): array {
// preserves order of the array
if( ! array_key_exists( $oldKey, $array ) ) return $array;
$keys = array_keys( $array );
$keys[ array_search( $oldKey, $keys ) ] = $newKey;
return array_combine( $keys, $array );
}
First replaces spaces and . by placeholders in querystring before coding before parsing, later undoes that within the array keys and values. This way we can use the normal parse_str.
Unmatched [ and ] are also replaced by underscores by parse_str, but these cannot be reliably replaced by a placeholder. And we definitely don't want to replace matched []. Hence we don't replace [ and ], en let them be replaced by underscores by parse_str. Then we restore the _ in the resulting keys and seeing in the original querystring if there was a [ or ] there.
Known bug: keys 'something]something' and almost identical 'something[something' may be confused. It's occurrence will be zero, so I left it.
Test:
var_dump(parse_str_clean("code.1=printr%28hahaha&code 1=448044&test.mijn%5B%5D%5B2%5D=test%20Roemer&test%20mijn%5B=test%202e%20Roemer"));
yields correctly
array(4) {
["code.1"]=>
string(13) "printr(hahaha"
["code 1"]=>
string(6) "448044"
["test.mijn"]=>
array(1) {
[0]=>
array(1) {
[2]=>
string(11) "test Roemer"
}
}
["test[mijn"]=>
string(14) "test 2e Roemer"
}
whereas the original parse_str only yields, with the same string:
array(2) {
["code_1"]=>
string(6) "448044"
["test_mijn"]=>
string(14) "test 2e Roemer"
}

php match string to multiple array of keywords

I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
Thanks,
Steve
EDIT:
Thanks to #jmathai, I was able to add ranking:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
This can be done with a single loop.
I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).
Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.
Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);

Categories