Regex issue with named captured pairs - php

I have the following value:
start=2011-03-10T13:00:00Z;end=2011-03-30T13:00:00Z;scheme=W3C-DTF
I use the following regular expression to strip out the 'start' and 'end' dates and assign them to their own named capture pair:
#^start=(?P<publishDate>.+);end=(?P<expirationDate>.+);#ix'
Probably not the absolute best REGEX, but it works well enough if both 'start' and 'end' values are present.
Now, what I need to do is still match 'publishDate' if 'expirationDate' is missing and vise-versa.
How can I do this using a single expression? I'm not the greatest at regular expressions and I'm starting to wander off into the more advanced stuff, so any help with this would be greatly appreciated.
Thanks!
UPDATE:
Thanks to Mr. Chung, I have resolved this issue with the following expression:
#^(start=(?P<publishDate>.*?);)?(end=(?P<expirationDate>.*?);)?#xi
As always, thank you so much for all of your help, everyone. :)

Use (...)? for an optional section
^(start=(?P<publishDate>.+);)?(end=(?P<expirationDate>.+));)?

These both set the named buffer to a value (instead of null or undefined)
I recommend the first one.
1. To find either/both in any order:
/^(?=.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?=.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
/^(?= # from beginning, look ahead for start
.*\b # any character 0 or more times (backtrack to match 'start')
start=(?P<publishDate>.*?); # put start date in publish
| (?P<publishDate>) # OR, put empty string publish
)
(?= # from beginning, look ahead for end
.*\b # same criteria as above ...
end=(?P<expirationDate>.*?);
| (?P<expirationDate>)
)
/ix
2. To find either/both in start/end order:
/^(?:.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?:.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
Edit -
#Josh Davis - I had to go searching PCRE.org, some great stuff there.
With Perl there is no problem with duplicate names.
Docs: "If multiple groups have the same name then it refers to the leftmost defined group in the current match."
The is never a problem when used in an alternation.
With PCRE ..
Duplicate names will work properly with PHP if its used with the branch reset.
Branch reset insures duplicate names will occupy the same capture group.
After that, using the dup names constant, $match['name'] will either contain a value
or an empty string, but it will exist.
ie:
(?J) = PCRE_DUPNAMES
(?| ... | ...) = Branch reset
This works:
/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x
Try it here: http://www.ideone.com/zYd24
<?php
$string = "start=2011-03-(start)10T13:00:00Z;end=2011-03-(end)30T13:00:00Z;scheme=W3C-DTF";
preg_match('/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x', $string, $matches);
echo "Published = ",$matches['publishDate'],"\n";
echo "Expires = ",$matches['expirationDate'],"\n";
print_r($matches);
?>
Output
Published = 2011-03-(start)10T13:00:00Z
Expires = 2011-03-(end)30T13:00:00Z
Array
(
[0] =>
[expirationDate] => 2011-03-(end)30T13:00:00Z
[1] => 2011-03-(end)30T13:00:00Z
[publishDate] => 2011-03-(start)10T13:00:00Z
[2] => 2011-03-(start)10T13:00:00Z
)

If 'start=;' isn't present when the corresponding date is absent, the Stephen Chung's code is OK
Otherwise I think that replacing '+' with '*' is enough:
#^start=(?P<publishDate>.*?);end=(?P<expirationDate>.*?);#ix'
By the way, the '?' is necessary to make the point ungreedy in every code

Related

Regexp for handling "test-12-1"-like strings (php)

I need some help with writing regexp to parse input strings like this ones:
test-12-1
blabla12412-5
t-dsf-gsdg-x-10
to next matches:
test and 1
blabla12412 and 5
t-dsf-gsdg-x and 10
I try to reach it by using something like
$matches = [];
preg_match('/^[a-zA-Z0-9]+(-\d+)+$/', 'test-12-1', $matches);
But I received unexpected result:
array (
0 => 'test-12-1',
1 => '-1',
)
You can move forward with help on this playground: https://ru.functions-online.com/preg_match.html?command={"pattern":"/^[a-zA-Z0-9]+(-\d+)+$/","subject":"test-12-1"}
Thanks a lot!
You may use
'~^(.*?)(?:-(\d+))+$~'
See the regex demo
Details
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
(?:-(\d+))+ - 1 or more occurrences of
- - a hyphen
(\d+) - Group 2: one or more digits (the last occurrence is kept in the group value since it is located in a repeated non-capturing group)
$ - end of string.

Find/Replace array() with Regular Expression

I'm trying to search though my code replacing all old style PHP array()s with the shorthand [] style. However, I'm having some trouble creating a working/reliable regex...
What I currently have: (^|[\s])array\((['"](\s\S)['"]|[^)])*\) (View on Regex101)
// Match All
array('array()')
array('key' => 'value');
array(
'key' => 'value',
'key2' => '(value2)'
);
array()
array()
array()
// Match Specific Parts
function (array $var = array()) {}
$this->in_array(array('something', 'something'));
// Don't match
toArray()
array_merge()
in_array();
I've created a Regex101 for it...
EDIT: This isn't the answer to the question, but one alternative is to use PHPStorm's Traditional syntax array literal detected inspection...
How to:
Open the Code menu
Click Run inspection by name... (Ctrl + Alt + Shift + I)
Type Traditional syntax array literal detected
Press <Enter>
Specify the scope you wish to run it on
Press <Enter>
Review/Apply the changes in the Inspection window.
It is possible but not trivial since you need to fully describe two parts of the PHP syntax (that are strings and comments) to prevent parenthesis to be interpreted inside them. Here is a way to do it with PHP itself:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<quotes> (["']) (?: [^"'\\]+ | \\. | (?!\g{-1})["'] )*+ (?:\g{-1}|\z) )
(?<heredoc> <<< (["']?) ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) \g{-2}\R
(?>\N*\R)*?
(?:\g{-1} ;? (?:\R | \z) | \N*\z)
)
(?<string> \g<quotes> | \g<heredoc> )
(?<inlinecom> (?:// |\# ) \N* $ )
(?<multicom> /\*+ (?:[^*]+|\*+(?!/))*+ (?:\*/|\z))
(?<com> \g<multicom> | \g<inlinecom> )
(?<nestedpar> \( (?: [^()"'<]+ | \g<com> | \g<string> | < | \g<nestedpar>)*+ \) )
)
(?:\g<com> | \g<string> ) (*SKIP)(*FAIL)
|
(?<![-$])\barray\s*\( ((?:[^"'()/\#]+|\g<com>|/|\g<string>|\g<nestedpar>)*+) \)
~xsm
EOD;
do {
$code = preg_replace($pattern, '[${11}]', $code, -1, $count);
} while ($count);
The pattern contains two parts, the first is a definition part and the second is the main pattern.
The definition part is enclosed between (?(DEFINE)...) and contains named subpattern definitions for different useful elements (in particular "string" "com" and "nestedpar"). These subpatterns would be used later in the main pattern.
The idea is to never search a parenthese inside a comment, a string or among nested parentheses.
The first line: (?:\g<com> | \g<string> ) (*SKIP)(*FAIL) will skip all comments and strings until the next array declaration (or until the end of the string).
The last line describes the array declaration itself, details:
(?<![-$])\b # check if "array" is not a part of a variable or function name
array \s*\(
( # capture group 11
(?: # describe the possible content
[^"'()/\#]+ # all that is not a quote, a round bracket, a slash, a sharp
| # OR
\g<com> # a comment
|
/ # a slash that is not a part of a comment
|
\g<string> # a string
|
\g<nestedpar> # nested round brackets
)*+
)
\)
pattern demo
code demo
about nested array declarations:
The present pattern is only able to find the outermost array declaration when a block of nested array declarations is found.
The do...while loop is used to deal with nested array declarations, because it is not possible to perform a replacement of several nesting level in one pass (however, there is a way with preg_replace_callback but it isn't very handy). To stop the loop, the last parameter of preg_replace is used. This parameter contains the number of replacements performed in the target string.

PHP Regex preg_match extraction

Although I have enough knowledge of regex in pseudocode, I'm having trouble to translate what I want to do in php regex perl.
I'm trying to use preg_match to extract part of my expression.
I have the following string ${classA.methodA.methodB(classB.methodC(classB.methodD)))} and i need to do 2 things:
a. validate the syntax
${classA.methodA.methodB(classB.methodC(classB.methodD)))} valid
${classA.methodA.methodB} valid
${classA.methodA.methodB()} not valid
${methodB(methodC(classB.methodD)))} not valid
b. I need to extract those information
${classA.methodA.methodB(classB.methodC(classB.methodD)))} should return
1. classA
2. methodA
3. methodB(classB.methodC(classB.methodD)))
I've created this code
$expression = '${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}';
$pattern = '/\$\{(?:([a-zA-Z0-9]+)\.)(?:([a-zA-Z\d]+)\.)*([a-zA-Z\d.()]+)\}/';
if(preg_match($pattern, $expression, $matches))
{
echo 'found'.'<br/>';
for($i = 0; $i < count($matches); $i++)
echo $i." ".$matches[$i].'<br/>';
}
The result is :
found
0 ${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}
1 myvalue
2 fsdf
3 blo(fsdf.fsfds(fsfs.fs))
Obviously I'm having difficult to extract repetitive methods and it is not validating it properly (honestly I left it for last once i solve the other problem) so empty parenthesis are allowed and it is not checking whether or not that once a parenthesis is opened it must be closed.
Thanks all
UPDATE
X m.buettner
Thanks for your help. I did a fast try to your code but it gives a very small issue, although i can by pass it. The issue is the same of one of my prior codes that i didn't post here which is when i try this string :
$expression = '${myvalue.fdsfs}';
with your pattern definition it shows :
found
0 ${myvalue.fdsfs}
1 myvalue.fdsfs
2 myvalue
3
4 fdsfs
As you can see the third line is catched as a white space which is not present. I couldn't understand why it was doing that so can you suggest me how to or i do have to live with it due to php regex limits?
That said i just can tell you thank you. Not only you answered to my problem but also you tried to input as much as information as possible with many suggestion on proper path to follow when developing patterns.
One last thing i (stupid) forgot to add one little important case which is multiple parameters divided by a comma so
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB)}';
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB,classD.mehtodDA)}';
must be valid.
I edited to this
$expressionPattern =
'/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
(?: # non-capturing group for multiple arguments
[,] # literal ,
(?1) # recursively apply the pattern inside group 1 on parameters
)* # end of multiple arguments group; repeat 0 or more times
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
X Casimir et Hippolyte
Your suggestion also is good but it implies a little complex situation when using this code. I mean the code itself is easy to understand but it get less flexible. That said it also gave me a lot of information that surely can be helpful in the future.
X Denomales
Thanks for your support but your code falls when i try this :
$sourcestring='${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}';
the result is :
Array
(
[0] => Array
(
[0] => ${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}
)
[1] => Array
(
[0] => classA1
)
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1.methodB1(classB.methodC(classB.methodD))
)
)
It should be
[2] => Array
(
[0] => methodA0.methodA1
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
or
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1
)
[4] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
This is a tough one. Recursive patterns are often beyond what's possible with regular expressions and even if it is possible, it can lead to very hard to expressions that are very hard to understand and maintain.
You are using PHP and therefore PCRE, which indeed supports the recursive regex constructs (?n). As your recursive pattern is quite regular it is possible to find a somewhat practical solution using regex.
One caveat I should mention right away: since you allow and arbitrary number of "intermediate" method calls per level (in your snippet fdsfs and fsdf), you can not get all of these in separate captures. That is simply impossible with PCRE. Each match will always yield the same finite number of captures, determined by the amount of opening parentheses your pattern contains. If a capturing group is used repeatedly (e.g. using something like ([a-z]+\.)+) then every time the group is used the previous capture will be overwritten and you only get the last instance. Therefore, I recommend that you capture all the "intermediate" method calls together, and then simply explode that result.
Likewise you couldn't (if you wanted to) get the captures of multiple nesting levels at once. Hence, your desired captures (where the last one includes all nesting levels) are the only option - you can then apply the pattern again to that last match to go a level further down.
Now for the actual expression:
$pattern = '/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
A few general notes: for complicated expressions (and in regex flavors that support it), always use the free-spacing x modifier which allows you to introduce whitespace and comments to format the expression to your desires. Without them, the pattern looks like this:
'/^[$][{](([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))[}]$/ix'
Even if you've written the regex yourself and you are the only one who ever works on the project - try understanding this a month from now.
Second, I've slightly simplified the pattern by using the case-insenstive i modifier. It simply removes some clutter, because you can omit the upper-case variants of your letters.
Third, note that I use single-character classes like [$] and [.] to escape characters where this is possible. That is simply a matter of taste, and you are free to use the backslash variants. I just personally prefer the readability of the character classes (and I know others here disagree), so I wanted to present you this option as well.
Fourth, I've added anchors around your pattern, so that there can be no invalid syntax outside of the ${...}.
Finally, how does the recursion work? (?n) is similar to a backreference \n, in that it refers to capturing group n (counted by opening parentheses from left to right). The difference is that a backreference tries to match again what was matched by group n, whereas (?n) applies the pattern again. That is (.)\1 matches any characters twice in a row, whereas (.)(?1) matches any character and then applies the pattern again, hence matching another arbitrary character. If you use one of those (?n) constructs within the nth group, you get recursion. (?0) or (?R) refers to the entire pattern. That is all the magic there is.
The above pattern applied to the input
'${abc.def.ghi.jkl(mno.pqr(stu.vwx))}'
will result in the captures
0 ${abc.def.ghi.jkl(mno.pqr(stu.vwx))}
1 abc.def.ghi.jkl(mno.pqr(stu.vwx))
2 abc
3 def.ghi.
4 jkl(mno.pqr(stu.vwx))
Note that there are a few differences to the outputs you actually expected:
0 is the entire match (and in this case just the input string again). PHP will always report this first, so you cannot get rid of it.
1 is the first capturing group which encloses the recursive part. You don't need this in the output, but (?n) unfortunately cannot refer to non-capturing groups, so you need this as well.
2 is the class name as desired.
3 is the list of intermediate method names, plus a trailing period. Using explode it's easy to extract all the method names from this.
4 is the final method name, with the optional (recursive) argument list. Now you could take this, and apply the pattern again if necessary. Note that for a completely recursive approach you might want to modify the pattern slightly. That is: strip off the ${ and } in a separate first step, so that the entire pattern has the exact same (recursive) pattern as the final capture, and you can use (?0) instead of (?1). Then match, remove method name, and parentheses, and repeat, until you get no more parentheses in the last capture.
For more information on recursion, have a look at PHP's PCRE documentation.
To illustrate my last point, here is a snippet that extracts all elements recursively:
if(!preg_match('/^[$][{](.*)[}]$/', $expression, $matches))
echo 'Invalid syntax.';
else
traverseExpression($matches[1]);
function traverseExpression($expression, $level = 0) {
$pattern = '/^(([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))$/i';
if(preg_match($pattern, $expression, $matches)) {
$indent = str_repeat(" ", 4*$level);
echo $indent, "Class name: ", $matches[2], "<br />";
foreach(explode(".", $matches[3], -1) as $method)
echo $indent, "Method name: ", $method, "<br />";
$parts = preg_split('/[()]/', $matches[4]);
echo $indent, "Method name: ", $parts[0], "<br />";
if(count($parts) > 1) {
echo $indent, "With arguments:<br />";
traverseExpression($parts[1], $level+1);
}
}
else
{
echo 'Invalid syntax.';
}
}
Note again, that I do not recommend using the pattern as a one-liner, but this answer is already long enough.
you can do validation and extraction with the same pattern, example:
$subjects = array(
'${classA.methodA.methodB(classB.methodC(classB.methodD))}',
'${classA.methodA.methodB}',
'${classA.methodA.methodB()}',
'${methodB(methodC(classB.methodD))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE)))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE())))}'
);
$pattern = <<<'LOD'
~
# definitions
(?(DEFINE)(?<vn>[a-z]\w*+))
# pattern
^\$\{
(?<classA>\g<vn>)\.
(?<methodA>\g<vn>)\.
(?<methodB>
\g<vn> (
\( \g<vn> \. \g<vn> (?-1)?+ \)
)?+
)
}$
~x
LOD;
foreach($subjects as $subject) {
echo "\n\nsubject: $subject";
if (preg_match($pattern, $subject, $m))
printf("\nclassA: %s\nmethodA: %s\nmethodB: %s",
$m['classA'], $m['methodA'], $m['methodB']);
else
echo "\ninvalid string";
}
Regex explanation:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
At the end of the pattern you can see the modifier x that allow spaces, newlines and commentary inside the pattern.
First the pattern begin with the definition of a named group vn (variable name), here you can define how classA or methodB looks like for all the pattern. Then you can refer to this definition in all the pattern with \g<vn>
Note that you can define if you want different type of name for classes and method adding other definitions. Example:
(?(DEFINE)(?<cn>....)) # for class name
(?(DEFINE)(?<mn>....)) # for method name
The pattern itself:
(?<classA>\g<vn>) capture in the named group classA with the pattern defined in vn
same thing for methodA
methodB is different cause it can contain nested parenthesis, it's the reason why i use a recursive pattern for this part.
Detail:
\g<vn> # the method name (methodB)
( # open a capture group
\( # literal opening parenthesis
\g<vn> \. \g<vn> # for classB.methodC⑴
(?-1)?+ # refer the last capture group (the actual capture group)
# one or zero time (possessive) to allow the recursion stop
# when there is no more level of parenthesis
\) # literal closing parenthesis
)?+ # close the capture group
# one or zero time (possessive)
# to allow method without parameters
⑴you can replace it by \g<vn>(?>\.\g<vn>)+ if you want to allow more than one method.
About possessive quantifiers:
You can add + after a quantifier ( * + ? ) to make it possessive, the advantage is that the regex engine know that it don't have to backtrack to test other ways to match with a subpattern. The regex is then more efficient.
Description
This expression will match and capture only ${classA.methodA.methodB(classB.methodC(classB.methodD)))} or ${classA.methodA.methodB} formats.
(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)
Groups
Group 0 gets the entire match from the start dollar sign to the close squiggly bracket
gets the Class
gets the first method
gets the second method followed by all the text upto but not including the close squiggly bracket. If this group has open round brackets which are empty () then this match will fail
PHP Code Example:
<?php
$sourcestring="${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
${classA2.methodA2.methodB2}
${classA3.methodA3.methodB3()}
${methodB4(methodC4(classB4.methodD)))}
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}";
preg_match_all('/(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
[1] =>
${classA2.methodA2.methodB2}
[2] =>
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}
)
[1] => Array
(
[0] => classA1
[1] => classA2
[2] => classA5
)
[2] => Array
(
[0] => methodA1
[1] => methodA2
[2] => methodA5
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD)))
[1] => methodB2
[2] => methodB5(classB.methodC(classB.methodD)))
)
)
Disclaimers
I added a number to the end of the class and method names to help illistrate what's happening in the groups
The sample text provided in the OP does not have balanced open and close round brackets.
Although () will be disallowed (()) will be allowed

Combine multiple match regular expression into one and get the matching ones

I have a list of regular expressions:
suresnes|suresne|surenes|surene
pommier|pommiers
^musique$
^(faq|aide)$
^(file )?loss( )?less$
paris
faq <<< this match twice
My use case is that each pattern which got a match display a link to my user,
so I can have multiple pattern matching.
I test thoses patterns against a simple string of text "live in paris" / "faq" / "pom"...
The simple way to do it is to loop over all the patterns with a preg_match, but I'm will do that a lot on a performance critical page, so this look bad to me.
Here is what I have tried: combining all thoses expressions into one with group names:
preg_match("#(?P<group1>^(faq|aide|todo|paris)$)|(?P<group2>(paris)$)#im", "paris", $groups);
As you can see, each pattern is grouped: (?P<GROUPNAME>PATTERN) and they are all separated by a pipe |.
The result is not what I expect, as only the first group matching is returned. Look like when a match occurs the parsing is stopped.
What I want is the list of all the matching groups. preg_match_all does not help neither.
Thanks!
How about:
preg_match("#(?=(?P<group1>^(faq|aide|todo|paris)$))(?=(?P<group2>(paris)$))#im", "paris", $groups);
print_r($groups);
output:
Array
(
[0] =>
[group1] => paris
[1] => paris
[2] => paris
[group2] => paris
[3] => paris
[4] => paris
)
The (?= ) is called lookahead
Explanation of the regex:
(?= # start lookahead
(?P<group1> # start named group group1
^ # start of string
( # start catpure group #1
faq|aide|todo|paris # match any of faq, aide, todo or paris
) # end capture group #1
$ # end of string
) # end of named group group1
) # end of lookahead
(?= # start lookahead
(?P<group2> # start named group group2
( # start catpure group #2
paris # paris
) # end capture group #2
$ # end of string
) # end of named group group2
) # end of lookahead
Try this approach:
#/ define input string
$str_1 = "{STRING HERE}";
#/ Define regex array
$reg_arr = array(
'suresnes|suresne|surenes|surene',
'pommier|pommiers',
'^musique$',
'^(faq|aide)$',
'^(file )?loss( )?less$',
'paris',
'faq'
);
#/ define a callback function to process Regex array
function cb_reg($reg_t)
{
global $str_1;
if(preg_match("/{$reg_t}/ims", $str_1, $matches)){
return $matches[1]; //replace regex pattern with the result of matching is the key trick here
//or return $matches[0]; if you dont want to get captured parenthesized subpatterns
//or you could return an array of both. its up to you how to do it.
}else{
return '';
}
}
#/ Apply array Regex via much faster function (instead of a loop)
$results = array_map('cb_reg', $reg_arr); //returns regex results
$results = array_diff($results, array('')); //remove empty values returned
Basically, this is the fastest way I could think of.
You can't combine say 100s of Regex into one call, as it would be very complex regex to build and will have several chances to fail matching. This is one of the best way to do it.
In my opinion, combining large number of Regex into 1 regex (if possibly achieved) will be slower to execute with preg_match, as compared to this approach of Callback on Arrays. Just remember, the key here is Callback function on array member values, which is fastest way to handle array for your and similar situation in php.
Also note,
The callback on Array is not equal to looping the Array. Looping is slower and has an n from algorithm analysis. But callback on array elements is internal and is very fast as compared.
You can combine all of your regexes with "|" in between them. Then apply this: http://www.rexegg.com/regex-optimizations.html, which will optimize it, collapse common expressions, etc.

PCRE conditional subpatterns: (R) as the condition

This is from the PHP manual regarding PCRE conditional subpatterns:
The two possible forms of conditional subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
That's OK as long as the condition is a digit or an assertion. But I don't quite understand the following
If the condition is the string (R), it is satisfied if a recursive
call to the pattern or subpattern has been made. At "top level", the
condition is false. (...) If the condition is not a sequence of digits
or (R), it must be an assertion.
I would be grateful if someone could explain on an example what (R) is in conditional subpattern and how to use it. Thanks in advance.
As an additional and clearer answer…
2 days ago I was writing a pattern to match an IPv4 address and I found myself using the recursion in condition so I thought I should share (because it makes more sense than imaginative examples).
~
(?:(?:f|ht)tps?://)? # possibly a protocol
(
(?(R)\.) # if it\'s a recursion, require a dot
(?: # this part basically looks for 0-255
2(?:[0-4]\d|5[0-5])
| 1\d\d
| \d\d?
)
)(?1){3} # go into recursion 3 times
# for clarity I\'m not including the remaining part
~xi
From what I understand (from the recursion as the condition in a subpattern) here's a very basic example.
$str = 'ds1aadfg346fgf gd4th9u6eth0';
preg_match_all('~(?(R).(?(?=[^\d])(?R))|\d(?R)?)~'
/*
(? # [begin outer cond.subpat.]
(R) # if this is a recursion ------> IF
. # match the first char
(? # [begin inner cond.subpat.]
(?=[^\d]) # if the next char is not a digit
(?R) # reenter recursion
) # [end inner cond.subpat.]
| # otherwise -----> ELSE
\d(?R)? # match a digit and enter recursion (note the ?)
) # [end outer cond.subpat.]
*/
,$str,$m);
print_r($m[0]);
And the output:
Array
(
[0] => 1aadfg
[1] => 34
[2] => 6fgf gd
[3] => 4th
[4] => 9u
[5] => 6eth
[6] => 0
)
I know this is a silly example but I hope it makes sense.
The (R) stands for recursion. Here is a good example of using it.
Recursive patterns
Not sure I have ever seen (?R) used as the condition, or even a situation where that would be usable, or at least not in my understanding. but you learn new stuff every day in programming.
It could be used very easily as the true or false statement.
as per this:
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
Where as (?R) is used in the false statement.
Which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested brackets (that is, when recursing), whereas any characters are permitted at the outer level.
I know this is not the answer you are looking for.... You have now sent me on a quest to research this.

Categories