PCRE conditional subpatterns: (R) as the condition - php

This is from the PHP manual regarding PCRE conditional subpatterns:
The two possible forms of conditional subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
That's OK as long as the condition is a digit or an assertion. But I don't quite understand the following
If the condition is the string (R), it is satisfied if a recursive
call to the pattern or subpattern has been made. At "top level", the
condition is false. (...) If the condition is not a sequence of digits
or (R), it must be an assertion.
I would be grateful if someone could explain on an example what (R) is in conditional subpattern and how to use it. Thanks in advance.

As an additional and clearer answer…
2 days ago I was writing a pattern to match an IPv4 address and I found myself using the recursion in condition so I thought I should share (because it makes more sense than imaginative examples).
~
(?:(?:f|ht)tps?://)? # possibly a protocol
(
(?(R)\.) # if it\'s a recursion, require a dot
(?: # this part basically looks for 0-255
2(?:[0-4]\d|5[0-5])
| 1\d\d
| \d\d?
)
)(?1){3} # go into recursion 3 times
# for clarity I\'m not including the remaining part
~xi

From what I understand (from the recursion as the condition in a subpattern) here's a very basic example.
$str = 'ds1aadfg346fgf gd4th9u6eth0';
preg_match_all('~(?(R).(?(?=[^\d])(?R))|\d(?R)?)~'
/*
(? # [begin outer cond.subpat.]
(R) # if this is a recursion ------> IF
. # match the first char
(? # [begin inner cond.subpat.]
(?=[^\d]) # if the next char is not a digit
(?R) # reenter recursion
) # [end inner cond.subpat.]
| # otherwise -----> ELSE
\d(?R)? # match a digit and enter recursion (note the ?)
) # [end outer cond.subpat.]
*/
,$str,$m);
print_r($m[0]);
And the output:
Array
(
[0] => 1aadfg
[1] => 34
[2] => 6fgf gd
[3] => 4th
[4] => 9u
[5] => 6eth
[6] => 0
)
I know this is a silly example but I hope it makes sense.

The (R) stands for recursion. Here is a good example of using it.
Recursive patterns
Not sure I have ever seen (?R) used as the condition, or even a situation where that would be usable, or at least not in my understanding. but you learn new stuff every day in programming.
It could be used very easily as the true or false statement.
as per this:
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
Where as (?R) is used in the false statement.
Which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested brackets (that is, when recursing), whereas any characters are permitted at the outer level.
I know this is not the answer you are looking for.... You have now sent me on a quest to research this.

Related

RegExp in PHP. Get text between first level parentheses

I have two type of strings in one text:
a(bc)de(fg)h
a(bcd(ef)g)h
I need to get text between first level parentheses. In my example this is:
bc
fg
bcd(ef)g
I tried to use next regular expression /\((.+)\)/ with Ungreedy (U) flag:
bc
fg
bcd(ef
And without it:
bc)de(fg
bcd(ef)g
Both variants don't do what I need. Maybe someone know how solve my issue?
Use PCRE Recursive pattern to match substrings in nested parentheses:
$str = "a(bc)de(fg)h some text a(bcd(ef)g)h ";
preg_match_all("/\((((?>[^()]+)|(?R))*)\)/", $str, $m);
print_r($m[1]);
The output:
Array
(
[0] => bc
[1] => fg
[2] => bcd(ef)g
)
\( ( (?>[^()]+) | (?R) )* \)
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a
recursive match of the pattern itself (i.e. a correctly parenthesized
substring). Finally, there is a closing parenthesis.
Technical cautions:
If there are more than 15 capturing parentheses in a pattern, PCRE has
to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no
memory can be obtained, it saves data for the first 15 capturing
parentheses only, as there is no way to give an out-of-memory error
from within a recursion.
This question pretty much has the answer, but the implementations are a little ambiguous. You can use the logic in the accepted answer without the ~s to get this regex:
\(((?:\[^\(\)\]++|(?R))*)\)
Tested with this output:
Please can you try that:
preg_match("/\((.+)\)/", $input_line, $output_array);
Test this code in http://www.phpliveregex.com/
Regex: \((.+)\)
Input: a(bcd(eaerga(er)gaergf)g)h
Output: array(2
0 => (bcd(eaerga(er)gaergf)g)
1 => bcd(eaerga(er)gaergf)g
)

An easier way to match a pattern either x or y times

I am trying to write a regex to validate a numerical code. The code can be one of a few valid lengths, but not all of the lengths between. I know I can do something like
preg_match("/^([0-9]{".$x."}|[0-9]{".$y".})$/", "$string")
but would rather not have to repeat the subpattern for each valid length. Especially as my actual regular expression is already going to be on the complex side.
preg_match("/^[0-9]{".$x.",".$y."})$/", "$string")
obviously won't work for me, as it would also match any number of digits between $x and $y.
Is their an easier way to use a regex to match a pattern either x or y times?
Edit:
While my complete regex is a bit complex, the portion that can be repeated x or y times is very simple [0-9], so answers like the ones given by sln and barmar, while interesting, will not solve the problem in this particular case.
Especially as my actual regular expression is already going to be on the complex side.
Then this is the only alternative available in all of Regular Expression land.
I guess since this is PHP, you can always put a singular unit in a function, then call the function using a range quantifier in a series of alternations...
You can rename the function's a little less descriptive, like A, B, or C ...
The big advantage is that you can inject other separator code, for instance
(?:(?&digit)\s*){4} or whatever you want.
# \b(?:(?&digit){4}|(?&digit){7}|(?&digit){9}|(?&digit){11})\b(?(DEFINE)(?<digit>[0-9]))
\b # add a boundary here
(?:
(?&digit){4} # match 4 times
| (?&digit){7} # or, match 7 times
| (?&digit){9} # or, match 9 times
| (?&digit){11} # or, match 11 times
)
\b # add a boundary here
(?(DEFINE) # Add complex expressions here
(?<digit> [0-9] ) # (1)
)
Input: 1234 1234567 123456789
Output:
** Grp 0 - ( pos 0 , len 4 )
1234
** Grp 0 - ( pos 5 , len 7 )
1234567
** Grp 0 - ( pos 13 , len 9 )
123456789
Put the complex pattern in a variable, then use that in the regexp.
$pat = 'complex pattern';
preg_match("/^($pat{" . $x . "}|$pat{" . $y . "})$/", $string);
Note, by the way, that I had to use concatenation rather than substitution to get $x and $y inside the {} in the regexp. This is because {$var} is the syntax for "complex variable substitution" in PHP strings, so the {} will not appear in the resulting string. See the PHP documentation on double quoted strings.

PHP Regex preg_match extraction

Although I have enough knowledge of regex in pseudocode, I'm having trouble to translate what I want to do in php regex perl.
I'm trying to use preg_match to extract part of my expression.
I have the following string ${classA.methodA.methodB(classB.methodC(classB.methodD)))} and i need to do 2 things:
a. validate the syntax
${classA.methodA.methodB(classB.methodC(classB.methodD)))} valid
${classA.methodA.methodB} valid
${classA.methodA.methodB()} not valid
${methodB(methodC(classB.methodD)))} not valid
b. I need to extract those information
${classA.methodA.methodB(classB.methodC(classB.methodD)))} should return
1. classA
2. methodA
3. methodB(classB.methodC(classB.methodD)))
I've created this code
$expression = '${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}';
$pattern = '/\$\{(?:([a-zA-Z0-9]+)\.)(?:([a-zA-Z\d]+)\.)*([a-zA-Z\d.()]+)\}/';
if(preg_match($pattern, $expression, $matches))
{
echo 'found'.'<br/>';
for($i = 0; $i < count($matches); $i++)
echo $i." ".$matches[$i].'<br/>';
}
The result is :
found
0 ${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}
1 myvalue
2 fsdf
3 blo(fsdf.fsfds(fsfs.fs))
Obviously I'm having difficult to extract repetitive methods and it is not validating it properly (honestly I left it for last once i solve the other problem) so empty parenthesis are allowed and it is not checking whether or not that once a parenthesis is opened it must be closed.
Thanks all
UPDATE
X m.buettner
Thanks for your help. I did a fast try to your code but it gives a very small issue, although i can by pass it. The issue is the same of one of my prior codes that i didn't post here which is when i try this string :
$expression = '${myvalue.fdsfs}';
with your pattern definition it shows :
found
0 ${myvalue.fdsfs}
1 myvalue.fdsfs
2 myvalue
3
4 fdsfs
As you can see the third line is catched as a white space which is not present. I couldn't understand why it was doing that so can you suggest me how to or i do have to live with it due to php regex limits?
That said i just can tell you thank you. Not only you answered to my problem but also you tried to input as much as information as possible with many suggestion on proper path to follow when developing patterns.
One last thing i (stupid) forgot to add one little important case which is multiple parameters divided by a comma so
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB)}';
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB,classD.mehtodDA)}';
must be valid.
I edited to this
$expressionPattern =
'/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
(?: # non-capturing group for multiple arguments
[,] # literal ,
(?1) # recursively apply the pattern inside group 1 on parameters
)* # end of multiple arguments group; repeat 0 or more times
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
X Casimir et Hippolyte
Your suggestion also is good but it implies a little complex situation when using this code. I mean the code itself is easy to understand but it get less flexible. That said it also gave me a lot of information that surely can be helpful in the future.
X Denomales
Thanks for your support but your code falls when i try this :
$sourcestring='${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}';
the result is :
Array
(
[0] => Array
(
[0] => ${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}
)
[1] => Array
(
[0] => classA1
)
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1.methodB1(classB.methodC(classB.methodD))
)
)
It should be
[2] => Array
(
[0] => methodA0.methodA1
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
or
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1
)
[4] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
This is a tough one. Recursive patterns are often beyond what's possible with regular expressions and even if it is possible, it can lead to very hard to expressions that are very hard to understand and maintain.
You are using PHP and therefore PCRE, which indeed supports the recursive regex constructs (?n). As your recursive pattern is quite regular it is possible to find a somewhat practical solution using regex.
One caveat I should mention right away: since you allow and arbitrary number of "intermediate" method calls per level (in your snippet fdsfs and fsdf), you can not get all of these in separate captures. That is simply impossible with PCRE. Each match will always yield the same finite number of captures, determined by the amount of opening parentheses your pattern contains. If a capturing group is used repeatedly (e.g. using something like ([a-z]+\.)+) then every time the group is used the previous capture will be overwritten and you only get the last instance. Therefore, I recommend that you capture all the "intermediate" method calls together, and then simply explode that result.
Likewise you couldn't (if you wanted to) get the captures of multiple nesting levels at once. Hence, your desired captures (where the last one includes all nesting levels) are the only option - you can then apply the pattern again to that last match to go a level further down.
Now for the actual expression:
$pattern = '/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
A few general notes: for complicated expressions (and in regex flavors that support it), always use the free-spacing x modifier which allows you to introduce whitespace and comments to format the expression to your desires. Without them, the pattern looks like this:
'/^[$][{](([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))[}]$/ix'
Even if you've written the regex yourself and you are the only one who ever works on the project - try understanding this a month from now.
Second, I've slightly simplified the pattern by using the case-insenstive i modifier. It simply removes some clutter, because you can omit the upper-case variants of your letters.
Third, note that I use single-character classes like [$] and [.] to escape characters where this is possible. That is simply a matter of taste, and you are free to use the backslash variants. I just personally prefer the readability of the character classes (and I know others here disagree), so I wanted to present you this option as well.
Fourth, I've added anchors around your pattern, so that there can be no invalid syntax outside of the ${...}.
Finally, how does the recursion work? (?n) is similar to a backreference \n, in that it refers to capturing group n (counted by opening parentheses from left to right). The difference is that a backreference tries to match again what was matched by group n, whereas (?n) applies the pattern again. That is (.)\1 matches any characters twice in a row, whereas (.)(?1) matches any character and then applies the pattern again, hence matching another arbitrary character. If you use one of those (?n) constructs within the nth group, you get recursion. (?0) or (?R) refers to the entire pattern. That is all the magic there is.
The above pattern applied to the input
'${abc.def.ghi.jkl(mno.pqr(stu.vwx))}'
will result in the captures
0 ${abc.def.ghi.jkl(mno.pqr(stu.vwx))}
1 abc.def.ghi.jkl(mno.pqr(stu.vwx))
2 abc
3 def.ghi.
4 jkl(mno.pqr(stu.vwx))
Note that there are a few differences to the outputs you actually expected:
0 is the entire match (and in this case just the input string again). PHP will always report this first, so you cannot get rid of it.
1 is the first capturing group which encloses the recursive part. You don't need this in the output, but (?n) unfortunately cannot refer to non-capturing groups, so you need this as well.
2 is the class name as desired.
3 is the list of intermediate method names, plus a trailing period. Using explode it's easy to extract all the method names from this.
4 is the final method name, with the optional (recursive) argument list. Now you could take this, and apply the pattern again if necessary. Note that for a completely recursive approach you might want to modify the pattern slightly. That is: strip off the ${ and } in a separate first step, so that the entire pattern has the exact same (recursive) pattern as the final capture, and you can use (?0) instead of (?1). Then match, remove method name, and parentheses, and repeat, until you get no more parentheses in the last capture.
For more information on recursion, have a look at PHP's PCRE documentation.
To illustrate my last point, here is a snippet that extracts all elements recursively:
if(!preg_match('/^[$][{](.*)[}]$/', $expression, $matches))
echo 'Invalid syntax.';
else
traverseExpression($matches[1]);
function traverseExpression($expression, $level = 0) {
$pattern = '/^(([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))$/i';
if(preg_match($pattern, $expression, $matches)) {
$indent = str_repeat(" ", 4*$level);
echo $indent, "Class name: ", $matches[2], "<br />";
foreach(explode(".", $matches[3], -1) as $method)
echo $indent, "Method name: ", $method, "<br />";
$parts = preg_split('/[()]/', $matches[4]);
echo $indent, "Method name: ", $parts[0], "<br />";
if(count($parts) > 1) {
echo $indent, "With arguments:<br />";
traverseExpression($parts[1], $level+1);
}
}
else
{
echo 'Invalid syntax.';
}
}
Note again, that I do not recommend using the pattern as a one-liner, but this answer is already long enough.
you can do validation and extraction with the same pattern, example:
$subjects = array(
'${classA.methodA.methodB(classB.methodC(classB.methodD))}',
'${classA.methodA.methodB}',
'${classA.methodA.methodB()}',
'${methodB(methodC(classB.methodD))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE)))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE())))}'
);
$pattern = <<<'LOD'
~
# definitions
(?(DEFINE)(?<vn>[a-z]\w*+))
# pattern
^\$\{
(?<classA>\g<vn>)\.
(?<methodA>\g<vn>)\.
(?<methodB>
\g<vn> (
\( \g<vn> \. \g<vn> (?-1)?+ \)
)?+
)
}$
~x
LOD;
foreach($subjects as $subject) {
echo "\n\nsubject: $subject";
if (preg_match($pattern, $subject, $m))
printf("\nclassA: %s\nmethodA: %s\nmethodB: %s",
$m['classA'], $m['methodA'], $m['methodB']);
else
echo "\ninvalid string";
}
Regex explanation:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
At the end of the pattern you can see the modifier x that allow spaces, newlines and commentary inside the pattern.
First the pattern begin with the definition of a named group vn (variable name), here you can define how classA or methodB looks like for all the pattern. Then you can refer to this definition in all the pattern with \g<vn>
Note that you can define if you want different type of name for classes and method adding other definitions. Example:
(?(DEFINE)(?<cn>....)) # for class name
(?(DEFINE)(?<mn>....)) # for method name
The pattern itself:
(?<classA>\g<vn>) capture in the named group classA with the pattern defined in vn
same thing for methodA
methodB is different cause it can contain nested parenthesis, it's the reason why i use a recursive pattern for this part.
Detail:
\g<vn> # the method name (methodB)
( # open a capture group
\( # literal opening parenthesis
\g<vn> \. \g<vn> # for classB.methodC⑴
(?-1)?+ # refer the last capture group (the actual capture group)
# one or zero time (possessive) to allow the recursion stop
# when there is no more level of parenthesis
\) # literal closing parenthesis
)?+ # close the capture group
# one or zero time (possessive)
# to allow method without parameters
⑴you can replace it by \g<vn>(?>\.\g<vn>)+ if you want to allow more than one method.
About possessive quantifiers:
You can add + after a quantifier ( * + ? ) to make it possessive, the advantage is that the regex engine know that it don't have to backtrack to test other ways to match with a subpattern. The regex is then more efficient.
Description
This expression will match and capture only ${classA.methodA.methodB(classB.methodC(classB.methodD)))} or ${classA.methodA.methodB} formats.
(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)
Groups
Group 0 gets the entire match from the start dollar sign to the close squiggly bracket
gets the Class
gets the first method
gets the second method followed by all the text upto but not including the close squiggly bracket. If this group has open round brackets which are empty () then this match will fail
PHP Code Example:
<?php
$sourcestring="${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
${classA2.methodA2.methodB2}
${classA3.methodA3.methodB3()}
${methodB4(methodC4(classB4.methodD)))}
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}";
preg_match_all('/(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
[1] =>
${classA2.methodA2.methodB2}
[2] =>
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}
)
[1] => Array
(
[0] => classA1
[1] => classA2
[2] => classA5
)
[2] => Array
(
[0] => methodA1
[1] => methodA2
[2] => methodA5
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD)))
[1] => methodB2
[2] => methodB5(classB.methodC(classB.methodD)))
)
)
Disclaimers
I added a number to the end of the class and method names to help illistrate what's happening in the groups
The sample text provided in the OP does not have balanced open and close round brackets.
Although () will be disallowed (()) will be allowed

Combine multiple match regular expression into one and get the matching ones

I have a list of regular expressions:
suresnes|suresne|surenes|surene
pommier|pommiers
^musique$
^(faq|aide)$
^(file )?loss( )?less$
paris
faq <<< this match twice
My use case is that each pattern which got a match display a link to my user,
so I can have multiple pattern matching.
I test thoses patterns against a simple string of text "live in paris" / "faq" / "pom"...
The simple way to do it is to loop over all the patterns with a preg_match, but I'm will do that a lot on a performance critical page, so this look bad to me.
Here is what I have tried: combining all thoses expressions into one with group names:
preg_match("#(?P<group1>^(faq|aide|todo|paris)$)|(?P<group2>(paris)$)#im", "paris", $groups);
As you can see, each pattern is grouped: (?P<GROUPNAME>PATTERN) and they are all separated by a pipe |.
The result is not what I expect, as only the first group matching is returned. Look like when a match occurs the parsing is stopped.
What I want is the list of all the matching groups. preg_match_all does not help neither.
Thanks!
How about:
preg_match("#(?=(?P<group1>^(faq|aide|todo|paris)$))(?=(?P<group2>(paris)$))#im", "paris", $groups);
print_r($groups);
output:
Array
(
[0] =>
[group1] => paris
[1] => paris
[2] => paris
[group2] => paris
[3] => paris
[4] => paris
)
The (?= ) is called lookahead
Explanation of the regex:
(?= # start lookahead
(?P<group1> # start named group group1
^ # start of string
( # start catpure group #1
faq|aide|todo|paris # match any of faq, aide, todo or paris
) # end capture group #1
$ # end of string
) # end of named group group1
) # end of lookahead
(?= # start lookahead
(?P<group2> # start named group group2
( # start catpure group #2
paris # paris
) # end capture group #2
$ # end of string
) # end of named group group2
) # end of lookahead
Try this approach:
#/ define input string
$str_1 = "{STRING HERE}";
#/ Define regex array
$reg_arr = array(
'suresnes|suresne|surenes|surene',
'pommier|pommiers',
'^musique$',
'^(faq|aide)$',
'^(file )?loss( )?less$',
'paris',
'faq'
);
#/ define a callback function to process Regex array
function cb_reg($reg_t)
{
global $str_1;
if(preg_match("/{$reg_t}/ims", $str_1, $matches)){
return $matches[1]; //replace regex pattern with the result of matching is the key trick here
//or return $matches[0]; if you dont want to get captured parenthesized subpatterns
//or you could return an array of both. its up to you how to do it.
}else{
return '';
}
}
#/ Apply array Regex via much faster function (instead of a loop)
$results = array_map('cb_reg', $reg_arr); //returns regex results
$results = array_diff($results, array('')); //remove empty values returned
Basically, this is the fastest way I could think of.
You can't combine say 100s of Regex into one call, as it would be very complex regex to build and will have several chances to fail matching. This is one of the best way to do it.
In my opinion, combining large number of Regex into 1 regex (if possibly achieved) will be slower to execute with preg_match, as compared to this approach of Callback on Arrays. Just remember, the key here is Callback function on array member values, which is fastest way to handle array for your and similar situation in php.
Also note,
The callback on Array is not equal to looping the Array. Looping is slower and has an n from algorithm analysis. But callback on array elements is internal and is very fast as compared.
You can combine all of your regexes with "|" in between them. Then apply this: http://www.rexegg.com/regex-optimizations.html, which will optimize it, collapse common expressions, etc.

Regex issue with named captured pairs

I have the following value:
start=2011-03-10T13:00:00Z;end=2011-03-30T13:00:00Z;scheme=W3C-DTF
I use the following regular expression to strip out the 'start' and 'end' dates and assign them to their own named capture pair:
#^start=(?P<publishDate>.+);end=(?P<expirationDate>.+);#ix'
Probably not the absolute best REGEX, but it works well enough if both 'start' and 'end' values are present.
Now, what I need to do is still match 'publishDate' if 'expirationDate' is missing and vise-versa.
How can I do this using a single expression? I'm not the greatest at regular expressions and I'm starting to wander off into the more advanced stuff, so any help with this would be greatly appreciated.
Thanks!
UPDATE:
Thanks to Mr. Chung, I have resolved this issue with the following expression:
#^(start=(?P<publishDate>.*?);)?(end=(?P<expirationDate>.*?);)?#xi
As always, thank you so much for all of your help, everyone. :)
Use (...)? for an optional section
^(start=(?P<publishDate>.+);)?(end=(?P<expirationDate>.+));)?
These both set the named buffer to a value (instead of null or undefined)
I recommend the first one.
1. To find either/both in any order:
/^(?=.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?=.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
/^(?= # from beginning, look ahead for start
.*\b # any character 0 or more times (backtrack to match 'start')
start=(?P<publishDate>.*?); # put start date in publish
| (?P<publishDate>) # OR, put empty string publish
)
(?= # from beginning, look ahead for end
.*\b # same criteria as above ...
end=(?P<expirationDate>.*?);
| (?P<expirationDate>)
)
/ix
2. To find either/both in start/end order:
/^(?:.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?:.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
Edit -
#Josh Davis - I had to go searching PCRE.org, some great stuff there.
With Perl there is no problem with duplicate names.
Docs: "If multiple groups have the same name then it refers to the leftmost defined group in the current match."
The is never a problem when used in an alternation.
With PCRE ..
Duplicate names will work properly with PHP if its used with the branch reset.
Branch reset insures duplicate names will occupy the same capture group.
After that, using the dup names constant, $match['name'] will either contain a value
or an empty string, but it will exist.
ie:
(?J) = PCRE_DUPNAMES
(?| ... | ...) = Branch reset
This works:
/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x
Try it here: http://www.ideone.com/zYd24
<?php
$string = "start=2011-03-(start)10T13:00:00Z;end=2011-03-(end)30T13:00:00Z;scheme=W3C-DTF";
preg_match('/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x', $string, $matches);
echo "Published = ",$matches['publishDate'],"\n";
echo "Expires = ",$matches['expirationDate'],"\n";
print_r($matches);
?>
Output
Published = 2011-03-(start)10T13:00:00Z
Expires = 2011-03-(end)30T13:00:00Z
Array
(
[0] =>
[expirationDate] => 2011-03-(end)30T13:00:00Z
[1] => 2011-03-(end)30T13:00:00Z
[publishDate] => 2011-03-(start)10T13:00:00Z
[2] => 2011-03-(start)10T13:00:00Z
)
If 'start=;' isn't present when the corresponding date is absent, the Stephen Chung's code is OK
Otherwise I think that replacing '+' with '*' is enough:
#^start=(?P<publishDate>.*?);end=(?P<expirationDate>.*?);#ix'
By the way, the '?' is necessary to make the point ungreedy in every code

Categories