Regex with possible empty matches and multi-line match - php

I've been trying to "parse" some data using a regex, and I feel as if I'm close, but I just can't seem to bring it all home.
The data that needs parsing generally looks like this: <param>: <value>\n. The number of params can vary, just as the value can. Still, here's an example:
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_ using `\`), and even basic markup!
To push this text into an object, I put together this little expresion
if (preg_match_all('/^([^:\n\\]+):\s*(.+)/m', $this->structuredMessage, $data))
{
$data = array_combine($data[1], $data[2]);
//$data is assoc array FooID => 123456, Name => Chuck, ...
$report = new Report($data);
}
Now, this works allright most of the time, except for the User Message bit: . doesn't match new lines, because if I were to use the s flag, the second group would match everything after FooID: till the very end of the string.
I'm having to use a dirty workaround for that:
$msg = explode(end($data[1], $string);
$data[2][count($data[2])-1] = array_pop($msg);
After some testing, I've come to understand that sometimes, one or two of the parameters aren't filled in (for example the InternalID can be empty). In that case, my expression doesn't fail, but rather results in:
[1] => Array
(
[0] => FooID
[1] => Name
[2] => When
[3] => InternalID
)
[2] => Array
(
[0] => 123465
[1] => Chuck
[2] => 01/02/2013 01:23:45
[3] => User Comment: Hello,
)
I've been trying various other expressions, and came up with this:
/^([^:\n\\]++)\s{0,}:(.*+)(?!^[^:\n\\]++\s{0,}:)/m
//or:
/^([^:\n\\]+)\s{0,}:(.*)(?!^[^:\\\n]+\s{0,}:)/m
The second version being slightly slower.
That solves the issues I had with InternalID: <void>, but still leaves me with the final obstacle: User Message: <multi-line>. Using the s flag doesn't do the trick with my expression ATM.
I can only think of this:
^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)
Which is, to my eye at least, too complex to be the only option. Ideas, suggestions, links, ... anything would be greatly appreciated

The following regex should work, but I'm not so sure anymore if it is the right tool for this:
preg_match_all(
'%^ # Start of line
([^:]*) # Match anything until a colon, capture in group 1
:\s* # Match a colon plus optional whitespace
( # Match and capture in group 2:
(?: # Start of non-capturing group (used for alternation)
.*$ # Either match the rest of the line
(?= # only if one of the following follows here:
\Z # The end of the string
| # or
\r?\n # a newline
[^:\n\\\\]* # followed by anything except colon, backslash or newline
: # then a colon
) # End of lookahead
| # or match
(?: # Start of non-capturing group (used for alternation/repetition)
[^:\\\\] # Either match a character except colon or backslash
| # or
\\\\. # match any escaped character
)* # Repeat as needed (end of inner non-capturing group)
) # End of outer non-capturing group
) # End of capturing group 2
$ # Match the end of the line%mx',
$subject, $result, PREG_PATTERN_ORDER);
See it live on regex101.

i'm pretty new to PHP so maybe this is totally out of whack, but maybe you could use something like
$data = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n's. It can be empty, too
EOT;
if ($key = preg_match_all('~^[^:\n]+?:~m', $data, $match)) {
$val = explode('¬', preg_filter('~^[^:\n]+?:~m', '¬', $data));
array_shift($val);
$res = array_combine($match[0], $val);
}
print_r($res);
yields
Array
(
[FooID:] => 123456
[Name:] => Chuck
[When:] => 01/02/2013 01:23:45
[InternalID:] => 789654
[User Message:] => Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of
's. It can be empty, too
)

So here's what I came up with using a tricky preg_replace_callback():
$string ='FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n\'s. It can be empty, too
Yellow:cool';
$array = array();
preg_replace_callback('#^(.*?):(.*)|.*$#m', function($m)use(&$array){
static $last_key = ''; // We are going to use this as a reference
if(isset($m[1])){// If there is a normal match (key : value)
$array[$m[1]] = $m[2]; // Then add to array
$last_key = $m[1]; // define the new last key
}else{ // else
$array[$last_key] .= PHP_EOL . $m[0]; // add the whole line to the last entry
}
}, $string); // Anonymous function used thus PHP 5.3+ is required
print_r($array); // print
Online demo
Downside: I'm using PHP_EOL to add newlines which is OS related.

I think I'd avoid using regex to do this task, instead split it into sub-tasks.
Basic algorithm outline
Split the string on \n using explode
Loop over the resulting array
Split the resulting strings on : also using explode with a limit of 2.
If the produced array's length is less than 2, add the entirety of the data to the previous key's value
Else, use the first array index as your key, the second as the value unless the split colon was escaped (in which case, instead add the key + split + value to the previous key's value)
This algorithm does assume there are no keys with escaped colons. Escaped colons in values will be dealt with just fine (i.e. user input).
Code
$str = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID:
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \\n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_
using `\`) like so `\:`, and even basic markup!
EOT;
$arr = explode("\n", $str);
$prevKey = '';
$split = ': ';
$output = array();
for ($i = 0, $arrlen = sizeof($arr); $i < $arrlen; $i++) {
$keyValuePair = explode($split, $arr[$i], 2);
// ?: Is this a valid key/value pair
if (sizeof($keyValuePair) < 2 && $i > 0) {
// -> Nope, append the value to the previous key's value
$output[$prevKey] .= "\n" . $keyValuePair[0];
}
else {
// -> Maybe
// ?: Did we miss an escaped colon
if (substr($keyValuePair[0], -1) === '\\') {
// -> Yep, this means this is a value, not a key/value pair append both key and
// value (including the split between) to the previous key's value ignoring
// any colons in the rest of the string (allowing dates to pass through)
$output[$prevKey] .= "\n" . $keyValuePair[0] . $split . $keyValuePair[1];
}
else {
// -> Nope, create a new key with a value
$output[$keyValuePair[0]] = $keyValuePair[1];
$prevKey = $keyValuePair[0];
}
}
}
var_dump($output);
Output
array(5) {
["FooID"]=>
string(6) "123456"
["Name"]=>
string(5) "Chuck"
["When"]=>
string(19) "01/02/2013 01:23:45"
["InternalID"]=>
string(0) ""
["User Message"]=>
string(293) "Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_
using `\`) like so `\:`, and even basic markup!"
}
Online demo

Related

Get numbers after dash "-" from an array and preserve dash in his original position. Regex

A user from other thread help me to figure out how to get the numbers from an array, but now I can't get the numbers afer "-" dash. Let me show you what I have and put you in situation.
I''ve got an array with the next content:
Array(
[0] => <tr><td>29/06/2015</td><td>19:35</td><td>12345 Column information</td><td>67899 Column information - 12</td><td>Information</td><td>More information</td></tr>
[1] => <tr><td>12/03/2015</td><td>10:12</td><td>98545 Column information</td><td>67659 Column information - 32</td><td>Information</td><td>More information</td></tr>
[2] => <tr><td>11/02/2015</td><td>12:40</td><td>59675 Column information</td><td>94859 Column information - 11</td><td>Information</td><td>More information</td></tr>
[3] => <tr><td>01/01/2015</td><td>20:12</td><td>69365 Column information</td><td>78464 Column information - 63</td><td>Information</td><td>More information</td></tr>
)
Finally I know how to get every number (except the number after dash "-"):
$re = "/.*?(\\d+)\\s.*?(\\d+)\\s.*/m";
$str = "<tr><td>29/06/2015</td><td>19:35</td><td>12345 Column information</td><td>67899 Column information - 12</td><td>Information</td><td>More information</td></tr>";
$subst = "$1, $2";
$result = preg_replace($re, $subst, $str);
Here's the $result; output:
foreach($result as $finalresult) echo $finalresult.'<br>';
12345,67899
98545,67659
59675,94859
69365,78464
What I expected from all this process and cannot figure out is to get the number after dash "-" too:
12345,67899-12
98545,67659-32
59675,94859-11
69365,78464-63
But this does not end here... when the number after dash "-" is lower than 50 I need to transform the $result output. See the example below.
If the number after "-" < 50 then it needs to be transformed, taking the first digit and putting it at units position. Then the tens position might be zero.
When is 50 or above, the number ramains as it is. Example:
12345,67899-12 ------> 12345,67899-01
98545,67659-32 ------> 12345,67899-03
59675,94859-11 ------> 12345,67899-01
52375,53259-49 ------> 12345,67899-04
69365,73464-63 ------> 12345,67899-63
89765,12332-51 ------> 12345,67899-51
38545,54213-70 ------> 12345,67899-70
And now is when my head explodes!
Beforehand thanks a lot for your help.
This may be what you are looking for. I modified your regular expression slightly. The (.*?<td>){3} will match anything up to the third <td>. The ?P<first> in the subpattern (?P<first>\d+) etc. is called a named subpattern, which makes their value easy to access from the $matches array.
$a = [
'<tr><td>29/06/2015</td><td>19:35</td><td>12345 Column information</td><td>67899 Column information - 12</td><td>Information</td><td>More information</td></tr>',
'<tr><td>12/03/2015</td><td>10:12</td><td>98545 Column information</td><td>67659 Column information - 32</td><td>Information</td><td>More information</td></tr>',
'<tr><td>11/02/2015</td><td>12:40</td><td>59675 Column information</td><td>94859 Column information - 11</td><td>Information</td><td>More information</td></tr>',
'<tr><td>01/01/2015</td><td>20:12</td><td>69365 Column information</td><td>78464 Column information - 63</td><td>Information</td><td>More information</td></tr>',
];
$result = [];
foreach ($a as $row) {
$p = '#(.*?<td>){3}(?P<first>\d+).*?</td><td>(?P<second>\d+).*?(?P<third>\d+)#';
if (preg_match($p, $row, $matches)) {
if ($matches['third'] < 50) {
$matches['third'] = '0'.$matches['third'][0];
}
$result[] =
$matches['first'] . ',' .
$matches['second'] . '-' .
$matches['third'];
}
}
print_r($result);
Output:
Array
(
[0] => 12345,67899-01
[1] => 98545,67659-03
[2] => 59675,94859-01
[3] => 69365,78464-63
)
This will do the trick for you:
$re = '/.*?(\d+)\s.*?(\d+)\s.*?-\s(\d+).*/';
$str = "<tr><td>29/06/2015</td><td>19:35</td><td>12345 Column information</td><td>67899 Column information - 12</td><td>Information</td><td>More information</td></tr>";
preg_match($re, $str, $matches);
if ($matches[3]<50) $matches[3] = floor($matches[3]/10);
$format = '%d,%d-%02d';
$result = sprintf($format, $matches[1], $matches[2], $matches[3]);
echo $result;
Note that I changed your $re to be single quoted instead of double quoted for readability, and I'm using preg_match instead of preg_replace so I can work with the matched patterns.
To explain the regex to you, there are a few things going on:
/ is the regex delimiter.
.*?: The . tells the regex to match any character. The * says to do it zero or more times, and the ? says to do it in a "lazy" fashion. The plain .* at the end of $re matches the whole rest of the string.
(\d+): The \d is a wildcard telling the regex to match any digit. The + says "one or more times", and the () says to capture this. The first () surrounded group is $matches[1].
\s: Is a wildcard for any space character.
-: Is the literal - character.
Well... I don't know if it will help, but I made this with RegExr and it fits properly:
(([0-9]+){5})|(- [0-9]{2})
I hope you might find it some use!

PHP Regular Expression to Match Function Name and Parameters with string like Needle(needle|needle)

I am filtering database results with a query string that looks like this:
attribute=operator(value|optional value)
I'll use
$_GET['attribute'];
to get the value.
I believe the right approach is using regex to get matches on the rest.
The preferred output would be
print_r($matches);
array(
1 => operator
2 => value
3 => optional value
)
The operator will always be one word and consist of letters: like(), between(), in().
The values can be many different things including letters, numbers, spaces commas, quotation marks, etc...
I was asked where my code was failing and didn't include much code because of how poorly it worked. Based on the accepted answer, I was able to whip up a regex that almost works.
EDIT 1
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)";
Edit 2
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)"; // I thought this would work.
Edit 3
$pattern = "^([^\|(]+)\(([^\|()]+)(\|+)?([^\|()]+)?"; // this does work!
Edit 4
$pattern = "^([^\|(]+)\(([^\|()]+)(?:\|)?([^\|()]+)?"; // this gets rid of the middle matching group.
The only remaining problem is when the 2nd optional parameter does not exist, there is still an empty $matches array.
This script, with the input "operator(value|optional value)", returns the array you expect:
<?php
$attribute = $_GET['attribute'];
$result = preg_match("/^([\w ]+)\(([\w ]+)\|([\w ]*)\)$/", $attribute, $matches);
print($matches[1] . "\n");
print($matches[2] . "\n");
print($matches[3] . "\n");
?>
This assumes your "values" match [\w ] regexp (all word characters plus space), and that the | you specify is a literal |...

Efficient way to parse this string into array in PHP?

Background
I have an array which I create by splitting a string based on every occurrence of 0d0a using preg_split('/(?<=0d0a)(?!$)/').
For example:
$string = "78781110d0a78782220d0a";
will be split into:
Array ( [0] => 78781110d0a [1] => 78782220d0a )
A valid array element has to start with 7878 and end with 0d0a.
The Problem
But sometimes, there's an additional 0d0a in the string which splits into an extra and invalid array element, i.e., that doesn't begin with 7878.
Take this string for example:
$string = "78781110d0a2220d0a78783330d0a";
This is split into:
Array ( [0] => 78781110d0a [1] => 2220d0a [2] => 78783330d0a )
But it should actually be:
Array ( [0] => 78781110d0a2220d0a [1] => 78783330d0a)
My Solution
I've written the following (messy) code to get around this:
$data = Array('78781110d0a','2220d0a','78783330d0a');
$i = 0; //count for $data array;
$j = 0; //count for $dataFixed array;
$dataFixed = $data;
foreach($data as $packet) {
if (substr($packet,0,4) != "7878") { //if packet doesn't start with 7878, do some fixing
if ($i != 0) { //its the first packet, can't help it!
$j++;
if ((substr(strtolower($packet), -4, 4) == "0d0a")) { //if the packet doesn't end with 0d0a, its 'mostly' not valid, so discard it
$dataFixed[$i-$j] = $dataFixed[$i-$j] . $packet;
}
unset($dataFixed[$i-$j+1]);
$dataFixed = array_values($dataFixed);
}
}
$i++;
}
Description
I first copy the array to another array $dataFixed. In a foreach loop of the $data array, I check whether it starts with 7878. If it doesn't, I join it with the previous array in $data. I then unset the current array in $dataFixed and reset the array elements with array_values.
But I'm not very confident about this solution.. Is there a better, more efficient way?
UPDATE
What if the input string doesn't end in 0d0a like its supposed to? It will stick to the previous array element..
For e.g.: in the string 78781110d0a2220d0a78783330d0a0000, 0000 should be separated as another array element.
Use another positive lookahead (?=7878) to form:
preg_split('/(?<=0d0a)(?=7878)/',$string)
Note: I removed (?!$) because I wasn't sure what that was for, based on your example data.
For example, this code:
$string = "78781110d0a2220d0a78783330d0a";
$array = preg_split('/(?<=0d0a)(?=7878)(?!$)/',$string);
print_r($array);
Results in:
Array ( [0] => 78781110d0a2220d0a [1] => 78783330d0a )
UPDATE:
Based on your revised question of having possible random characters at the end of the input string, you can add three lines to make a complete program of:
$string = "78781110d0a2220d0a787830d0a330d0a0000";
$array = preg_split('/(?<=0d0a)(?=7878)/',$string);
$temp = preg_split('/(7878.*0d0a)/',$array[count($array)-1],null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
$array[count($array)-1] = $temp[0];
if(count($temp)>1) { $array[] = $temp[1]; }
print_r($array);
We basically do the initial splitting, then split the last element of the resulting array by the expected data format, keeping the delimiter using PREG_SPLIT_DELIM_CAPTURE. The PREG_SPLIT_NO_EMPTY ensures we won't get an empty array element if the input string doesn't end in random characters.
UPDATE 2:
Based on your comment below where it seems you're implying there might be random characters between any of the desired matches, and you want these random characters preserved, you could do this:
$string = "0078781110d0a2220d0a2220d0a0000787830d0a330d0a000078781110d0a2220d0a0000787830d0a330d0a0000";
$split1 = preg_split('/(7878.*?0d0a)/',$string,null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
$result = array();
foreach($split1 as $e){
$split2 = preg_split('/(.*0d0a)/',$e,null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
foreach($split2 as $el){
// test if $el doesn't start with 7878 and ends with 0d0a
if(strpos($el,'7878') !== 0 && substr($el,-4) == '0d0a'){
//if(preg_match('/^(?!7878).*0d0a$/',$el) === 1){
$result[ count($result)-1 ] = $result[ count($result)-1 ] . $el;
} else {
$result[] = $el;
}
}
}
print_r($result);
The strategy employed here is different than above. First we split the input string based on the delimiter that matches your desired data, using the nongreedy regex .*?. At this point we have some strings that contain the ending of a desired value and some garbage at the end, so we split again based on the last occurrence of "0d0a" with the greedy regex .*0d0a. We then append any of those resulting values that don't start with "7878" but end with "0d0a" to the previous value, as this should repair the first and second halves that got split because it contained an extra "0d0a".
I provided two methods for the innermost if statement, one using regular expressions. The regex one is marginally slower in my testing, so I've left that one commented out.
I might still not have your full requirements, so you'll have to let me know if it works and perhaps provided your full dataset.
I think you are using a delimiter "0d0a" which also happens to be part of a content! Its not possible to avoid getting junk data as long as delimiter can also be part of content. Somehow delimiter must be unique.
Possible solutions.
Change the delimited to something else that doesn't occur as part of your data ( 000000, #!.;)
If you are definite about length of text that easy arrange item may have, use it. As per examples its not possible.
Solutions given in answers considering only sample data you have shared. If you are confidant about what will be the content of string, then these solutions given by others are pretty good to use. Otherwise these solutions wont assure you guarantee!
Best solution: Fix right delimiter then use regex or explode whatever you prefer.
Why don't you use preg_match_all instead? You can avoid all of the non-capturing groups (the look aheads, look behinds) in order to split the string (which without the non-capturing groups removes the matches), and just find the matches you're looking for:
Updated
<?php
$string = "00787817878110d0a22278780d0a78783330d0a00";
preg_match_all('/7878.*?0d0a(?=7878|[^(7878)]*?$)/', $string, $arr);
print_r($arr);
?>
Gives an array $arr[0] => ( [0] => 787817878110d0a22278780d0a, [1] => 78783330d0a ). Strips leading and trailing garbage characters (whatever doesn't start with 7878 or end with 7878 or 0d0a.
So $arr[0] would be the array of values that you are looking for.
See example on ideone
Works with multiple 7878 values and multiple 0d0a values (even though that's ridiculous).
Update
If splitting is more your style, why not avoid regular expressions altogether?
<?php
$string = "787817878110d0a22278780d0a78783330d0a";
$arr = explode('0d0a7878', $string);
$string = implode('0d0a,7878', $arr);
$arr = explode(',', $string);
print_r($arr);
?>
Here we split the string by the delimiter 0d0a7878, which is what #CharlieGorichanaz's solution is doing, and props to him for the quick, accurate solution. We then add a comma, because who doesn't love comma separated values? And we explode again on the commas for an array of desired values. Performance-wise, this ought to be faster than using regular expressions. See example.

regex assistance

I am trying to match a semi dynamically generated string. So I can see if its the correct format, then extract the information from it that I need. My Problem is I no matter how hard I try to grasp regex can't fathom it for the life of me. Even with the help of so called generators.
What I have is a couple different strings like the following. [#img:1234567890] and [#user:1234567890] and [#file:file_name-with.ext]. Strings like this pass through are intent on passing through a filter so they can be replaced with links, and or more readable names. But again try as I might I can't come up with a regex for any given one of them.
I am looking for the format: [#word:] of which I will strip the [, ], #, and word from the string so I can then turn around an query my DB accordingly for whatever it is and work with it accordingly. Just the regex bit is holding me back.
Not sure what you mean by generators. I always use online matchers to see that my test cases work. #Virendra almost had it except forgot to escape the [] charaters.
/\[#(\w+):(.*)\]/
You need to start and end with a regex delimeter, in this case the '/' character.
Then we escape the '[]' which is use by regex to match ranges of characters hence the '['.
Next we match a literal '#' symbol.
Now we want to save this next match so we can use it later so we surround it with ().
\w matches a word. Basically any characters that aren't spaces, punctuation, or line characters.
Again match a literal :.
Maybe useful to have the second part in a match group as well so (.*) will match any character any number of times, and save it for you.
Then we escape the closing ] as we did earlier.
Since it sounds like you want to use the matches later in a query we can use preg_match to save the matches to an array.
$pattern = '/\[#(\w+):(.*)\]/';
$subject = '[#user:1234567890]';
preg_match($pattern, $subject, $matches);
print_r($matches);
Would output
array(
[0] => '[#user:1234567890]', // Full match
[1] => 'user', // First match
[2] => '1234567890' // Second match
)
An especially helpful tool I've found is txt2re
Here's what I would do.
<pre>
<?php
$subj = 'An image:[#img:1234567890], a user:[#user:1234567890] and a file:[#file:file_name-with.ext]';
preg_match_all('~(?<match>\[#(?<type>[^:]+):(?<value>[^\]]+)\])~',$subj,$matches,PREG_SET_ORDER);
foreach ($matches as &$arr) unset($arr[0],$arr[1],$arr[2],$arr[3]);
print_r($matches);
?>
</pre>
This will output
Array
(
[0] => Array
(
[match] => [#img:1234567890]
[type] => img
[value] => 1234567890
)
[1] => Array
(
[match] => [#user:1234567890]
[type] => user
[value] => 1234567890
)
[2] => Array
(
[match] => [#file:file_name-with.ext]
[type] => file
[value] => file_name-with.ext
)
)
And here's a pseudo version of how I would use the preg_replace_callback() function:
function replace_shortcut($matches) {
global $users;
switch (strtolower($matches['type'])) {
case 'img' : return '<img src="images/img_'.$matches['value'].'jpg" />';
case 'file' : return ''.$matches['value'].'';
// add id of each user in array
case 'user' : $users[] = (int) $matches['value']; return '%s';
default : return $matches['match'];
}
}
$users = array();
$replaceArr = array();
$subj = 'An image:[#img:1234567890], a user:[#user:1234567890] and a file:[#file:file_name-with.ext]';
// escape percentage signs to avoid complications in the vsprintf function call later
$subj = strtr($subj,array('%'=>'%%'));
$subj = preg_replace_callback('~(?<match>\[#(?<type>[^:]+):(?<value>[^\]]+)\])~',replace_shortcut,$subj);
if (!empty($users)) {
// connect to DB and check users
$query = " SELECT `id`,`nick`,`date_deleted` IS NOT NULL AS 'deleted'
FROM `users` WHERE `id` IN ('".implode("','",$users)."')";
// query
// ...
// and catch results
while ($row = $con->fetch_array()) {
// position of this id in users array:
$idx = array_search($row['id'],$users);
$nick = htmlspecialchars($row['nick']);
$replaceArr[$idx] = $row['deleted'] ?
"<span class=\"user_deleted\">{$nick}</span>" :
"{$nick}";
// delete this key so that we can check id's not found later...
unset($users[$idx]);
}
// in here:
foreach ($users as $key => $value) {
$replaceArr[$key] = '<span class="user_unknown">User'.$value.'</span>';
}
// replace each user reference marked with %s in $subj
$subj = vsprintf($subj,$replaceArr);
} else {
// remove extra percentage signs we added for vsprintf function
$subj = preg_replace('~%{2}~','%',$subj);
}
unset($query,$row,$nick,$idx,$key,$value,$users,$replaceArr);
echo $subj;
You can try something like this:
/\[#(\w+):([^]]*)\]/
\[ escapes the [ character (otherwise interpreted as a character set); \w means any "word" character, and [^]]* means any non-] character (to avoid matching past the end of the tag, as .* might). The parens group the various matched parts so that you can use $1 and $2 in preg_replace to generate the replacement text:
echo preg_replace('/\[#(\w+):([^]]*)\]/', '$1 $2', '[#link:abcdef]');
prints link abcdef

Parsing plain text in such a way that will recognise a custom if statement

I have the following string:
$string = "The man has {NUM_DOGS} dogs."
I'm parsing this by running it through the following function:
function parse_text($string)
{
global $num_dogs;
$string = str_replace('{NUM_DOGS}', $num_dogs, $string);
return $string;
}
parse_text($string);
Where $num_dogs is a preset variable. Depending on $num_dogs, this could return any of the following strings:
The man has 1 dogs.
The man has 2 dogs.
The man has 500 dogs.
The problem is that in the case that "the man has 1 dogs", dog is pluralised, which is undesired. I know that this could be solved simply by not using the parse_text function and instead doing something like:
if($num_dogs = 1){
$string = "The man has 1 dog.";
}else{
$string = "The man has $num_dogs dogs.";
}
But in my application I'm parsing more than just {NUM_DOGS} and it'd take a lot of lines to write all the conditions.
I need a shorthand way which I can write into the initial $string which I can run through a parser, which ideally wouldn't limit me to just two true/false possibilities.
For example, let
$string = 'The man has {NUM_DOGS} [{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"].';
Is it clear what's happened at the end? I've attempted to initiate the creation of an array using the part inside the square brackets that's after the vertical bar, then compare the key of the new array with the parsed value of {NUM_DOGS} (which by now will be the $num_dogs variable at the left of the vertical bar), and return the value of the array entry with that key.
If that's not totally confusing, is it possible using the preg_* functions?
The premise of your question is that you want to match a specific pattern and then replace it after performing additional processing on the matched text.
Seems like an ideal candidate for preg_replace_callback
The regular expressions for capturing matched parenthesis, quotes, braces etc. can become quite complicated, and to do it all with a regular expression is in fact quite inefficient. In fact you'd need to write a proper parser if that's what you require.
For this question I'm going to assume a limited level of complexity, and tackle it with a two stage parse using regex.
First of all, the most simple regex I can think off for capturing tokens between curly braces.
/{([^}]+)}/
Lets break that down.
{ # A literal opening brace
( # Begin capture
[^}]+ # Everything that's not a closing brace (one or more times)
) # End capture
} # Literal closing brace
When applied to a string with preg_match_all the results look something like:
array (
0 => array (
0 => 'A string {TOK_ONE}',
1 => ' with {TOK_TWO|0=>"no", 1=>"one", 2=>"two"}',
),
1 => array (
0 => 'TOK_ONE',
1 => 'TOK_TWO|0=>"no", 1=>"one", 2=>"two"',
),
)
Looks good so far.
Please note that if you have nested braces in your strings, i.e. {TOK_TWO|0=>"hi {x} y"}, this regex will not work. If this wont be a problem, skip down to the next section.
It is possible to do top-level matching, but the only way I have ever been able to do it is via recursion. Most regex veterans will tell you that as soon as you add recursion to a regex, it stops being a regex.
This is where the additional processing complexity kicks in, and with long complicated strings it's very easy to run out of stack space and crash your program. Use it carefully if you need to use it at all.
The recursive regex taken from one of my other answers and modified a little.
`/{((?:[^{}]*|(?R))*)}/`
Broken down.
{ # literal brace
( # begin capture
(?: # don't create another capture set
[^{}]* # everything not a brace
|(?R) # OR recurse
)* # none or more times
) # end capture
} # literal brace
And this time the ouput only matches top-level braces
array (
0 => array (
0 => '{TOK_ONE|0=>"a {nested} brace"}',
),
1 => array (
0 => 'TOK_ONE|0=>"a {nested} brace"',
),
)
Again, don't use the recursive regex unless you have to. (Your system may not even support them if it has an old PCRE library)
With that out of the way we need to work out if the token has options associated with it. Instead of having two fragments to be matched as per your question, I'd recommend keeping the options with the token as per my examples. {TOKEN|0=>"option"}
Lets assume $match contains a matched token, if we check for a pipe |, and take the substring of everything after it we'll be left with your list of options, again we can use regex to parse them out. (Don't worry I'll bring everything together at the end)
/(\d)+\s*=>\s*"([^"]*)",?/
Broken down.
(\d)+ # Capture one or more decimal digits
\s* # Any amount of whitespace (allows you to do 0 => "")
=> # Literal pointy arrow
\s* # Any amount of whitespace
" # Literal quote
([^"]*) # Capture anything that isn't a quote
" # Literal quote
,? # Maybe followed by a comma
And an example match
array (
0 => array (
0 => '0=>"no",',
1 => '1 => "one",',
2 => '2=>"two"',
),
1 => array (
0 => '0',
1 => '1',
2 => '2',
),
2 => array (
0 => 'no',
1 => 'one',
2 => 'two',
),
)
If you want to use quotes inside your quotes, you'll have to make your own recursive regex for it.
Wrapping up, here's a working example.
Some initialisation code.
$options = array(
'WERE' => 1,
'TYPE' => 'cat',
'PLURAL' => 1,
'NAME' => 2
);
$string = 'There {WERE|0=>"was a",1=>"were"} ' .
'{TYPE}{PLURAL|1=>"s"} named bob' .
'{NAME|1=>" and bib",2=>" and alice"}';
And everything together.
$string = preg_replace_callback('/{([^}]+)}/', function($match) use ($options) {
$match = $match[1];
if (false !== $pipe = strpos($match, '|')) {
$tokens = substr($match, $pipe + 1);
$match = substr($match, 0, $pipe);
} else {
$tokens = array();
}
if (isset($options[$match])) {
if ($tokens) {
preg_match_all('/(\d)+\s*=>\s*"([^"]*)",?/', $tokens, $tokens);
$tokens = array_combine($tokens[1], $tokens[2]);
return $tokens[$options[$match]];
}
return $options[$match];
}
return '';
}, $string);
Please note the error checking is minimal, there will be unexpected results if you pick options that don't exist.
There's probably a lot simpler way to do all of this, but I just took the idea and ran with it.
First of all, it is a bit debatable, but if you can easily avoid it, just pass $num_dogs as an argument to the function as most people believe global variables are evil!
Next, for the getting the "s", I generally do something like this:
$dogs_plural = ($num_dogs == 1) ? '' : 's';
Then just do something like this:
$your_string = "The man has $num_dogs dog$dogs_plural";
It's essentially the same thing as doing an if/else block, but less lines of code and you only have to write the text once.
As for the other part, I am STILL confused about what you're trying to do, but I believe you are looking for some sort of way to convert
{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"]
into:
switch $num_dogs {
case 0:
return 'dogs';
break;
case 1:
return 'dog called fred';
break;
case 2:
return 'dogs called fred and harry';
break;
case 3:
return 'dogs called fred, harry and buster';
break;
}
The easiest way is to try to use a combination of explode() and regex to then get it to do something like I have above.
In a pinch, I have done something similar to what you're asking with an implementation vaguely like the code below.
This is nowhere near as feature rich as #Mike's answer, but it has done the trick in the past.
/**
* This function pluralizes words, as appropriate.
*
* It is a completely naive, example-only implementation.
* There are existing "inflector" implementations that do this
* quite well for many/most *English* words.
*/
function pluralize($count, $word)
{
if ($count === 1)
{
return $word;
}
return $word . 's';
}
/**
* Matches template patterns in the following forms:
* {NAME} - Replaces {NAME} with value from $values['NAME']
* {NAME:word} - Replaces {NAME:word} with 'word', pluralized using the pluralize() function above.
*/
function parse($template, array $values)
{
$callback = function ($matches) use ($values) {
$number = $values[$matches['name']];
if (array_key_exists('word', $matches)) {
return pluralize($number, $matches['word']);
}
return $number;
};
$pattern = '/\{(?<name>.+?)(:(?<word>.+?))?\}/i';
return preg_replace_callback($pattern, $callback, $template);
}
Here are some examples similar to your original question...
echo parse(
'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
array('NUM_DOGS' => 2)
);
echo parse(
'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
array('NUM_DOGS' => 1)
);
The output is:
The man has 2 dogs.
The man has 1 dog.
It may be worth mentioning that in larger projects I've invariably ended up ditching any custom rolled inflection in favour of GNU gettext which seems to be the most sane way forward once multi-lingual is a requirement.
This was copied from an answer posted by flussence back in 2009 in response to this question:
You might want to look at the gettext extension. More specifically, it sounds like ngettext() will do what you want: it pluralises words correctly as long as you have a number to count from.
print ngettext('odor', 'odors', 1); // prints "odor"
print ngettext('odor', 'odors', 4); // prints "odors"
print ngettext('%d cat', '%d cats', 4); // prints "4 cats"
You can also make it handle translated plural forms correctly, which is its main purpose, though it's quite a lot of extra work to do.

Categories