Find/Replace array() with Regular Expression - php

I'm trying to search though my code replacing all old style PHP array()s with the shorthand [] style. However, I'm having some trouble creating a working/reliable regex...
What I currently have: (^|[\s])array\((['"](\s\S)['"]|[^)])*\) (View on Regex101)
// Match All
array('array()')
array('key' => 'value');
array(
'key' => 'value',
'key2' => '(value2)'
);
array()
array()
array()
// Match Specific Parts
function (array $var = array()) {}
$this->in_array(array('something', 'something'));
// Don't match
toArray()
array_merge()
in_array();
I've created a Regex101 for it...
EDIT: This isn't the answer to the question, but one alternative is to use PHPStorm's Traditional syntax array literal detected inspection...
How to:
Open the Code menu
Click Run inspection by name... (Ctrl + Alt + Shift + I)
Type Traditional syntax array literal detected
Press <Enter>
Specify the scope you wish to run it on
Press <Enter>
Review/Apply the changes in the Inspection window.

It is possible but not trivial since you need to fully describe two parts of the PHP syntax (that are strings and comments) to prevent parenthesis to be interpreted inside them. Here is a way to do it with PHP itself:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<quotes> (["']) (?: [^"'\\]+ | \\. | (?!\g{-1})["'] )*+ (?:\g{-1}|\z) )
(?<heredoc> <<< (["']?) ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) \g{-2}\R
(?>\N*\R)*?
(?:\g{-1} ;? (?:\R | \z) | \N*\z)
)
(?<string> \g<quotes> | \g<heredoc> )
(?<inlinecom> (?:// |\# ) \N* $ )
(?<multicom> /\*+ (?:[^*]+|\*+(?!/))*+ (?:\*/|\z))
(?<com> \g<multicom> | \g<inlinecom> )
(?<nestedpar> \( (?: [^()"'<]+ | \g<com> | \g<string> | < | \g<nestedpar>)*+ \) )
)
(?:\g<com> | \g<string> ) (*SKIP)(*FAIL)
|
(?<![-$])\barray\s*\( ((?:[^"'()/\#]+|\g<com>|/|\g<string>|\g<nestedpar>)*+) \)
~xsm
EOD;
do {
$code = preg_replace($pattern, '[${11}]', $code, -1, $count);
} while ($count);
The pattern contains two parts, the first is a definition part and the second is the main pattern.
The definition part is enclosed between (?(DEFINE)...) and contains named subpattern definitions for different useful elements (in particular "string" "com" and "nestedpar"). These subpatterns would be used later in the main pattern.
The idea is to never search a parenthese inside a comment, a string or among nested parentheses.
The first line: (?:\g<com> | \g<string> ) (*SKIP)(*FAIL) will skip all comments and strings until the next array declaration (or until the end of the string).
The last line describes the array declaration itself, details:
(?<![-$])\b # check if "array" is not a part of a variable or function name
array \s*\(
( # capture group 11
(?: # describe the possible content
[^"'()/\#]+ # all that is not a quote, a round bracket, a slash, a sharp
| # OR
\g<com> # a comment
|
/ # a slash that is not a part of a comment
|
\g<string> # a string
|
\g<nestedpar> # nested round brackets
)*+
)
\)
pattern demo
code demo
about nested array declarations:
The present pattern is only able to find the outermost array declaration when a block of nested array declarations is found.
The do...while loop is used to deal with nested array declarations, because it is not possible to perform a replacement of several nesting level in one pass (however, there is a way with preg_replace_callback but it isn't very handy). To stop the loop, the last parameter of preg_replace is used. This parameter contains the number of replacements performed in the target string.

Related

How do I locate and replace text with a common element using regex?

I'm pretty lousy at regex, and need help with the following scenario. I need to locate and replace text that has a common structure, but one aspect will be different:
here is a string (with 3 values)
here is another string (with 5 values)
In the above examples, I need to locate and then replace the value in parenthesis. I can't search by parens alone, as the string may contain other parens. But the value in the parens that needs to be replaced is consistently constructed: (with # values) -- the only difference will be the number.
So ideally the regex returns (with 3 values) and (with 5 values) so I can use a simple str_replace to change the text.
This is regex in a PHP script.
Try with this regex :
\(with\s+\d+\s+values\)
Demo here
The following regex should work for you:
/\(with (\d+) values\)/g
This matches strings of the specified format and gives the value in a capture group so it may be used in the replace. The g flag at the end is only needed if you have multiple of these in one string.
Demo here
If, however, there can only be one digit, then the following will work:
/\(with (\d) values\)/g
Or, if the number can only be a digit greater than 1, for example, then the following:
/\(with ([2-9]) values\)/g
If I got you right, you are looking for exactly three or five items within parentheses (comma separated).
This could be accomplished by
\( # "(" literally
(?:[^,()]+,){2} # not , or ( or ) exactly two times
(?:(?:[^,()]+,){2})? # repeated
[^,()]+ # without the comma in the end
\) # the closing parenthesis
See a demo on regex101.com.
If you're really looking only for two variant of strings, you could very easily do
\(with (?:3|5) values\)
In general
\(with \d+ values\)
as proposed by #SchoolBoy.
Something like this maybe
$str ="here is another string (with 5 values)";
preg_match_all("/\(with (\d+) values\)/", $str, $out );
print_r( $out );
Output:
Array
(
[0] => Array
(
[0] => (with 5 values)
)
[1] => Array
(
[0] => 5
)
)
Here at ideone...
It uses the regex
\(with (\d+) values\)
that matches the literal opening parentheses followed by the string with # values, capturing the actual number #, and finally the closing parentheses.
It returns the complete match (the parenthesized string) in the first dimension and the actual number in the second.

PHP Regex preg_match extraction

Although I have enough knowledge of regex in pseudocode, I'm having trouble to translate what I want to do in php regex perl.
I'm trying to use preg_match to extract part of my expression.
I have the following string ${classA.methodA.methodB(classB.methodC(classB.methodD)))} and i need to do 2 things:
a. validate the syntax
${classA.methodA.methodB(classB.methodC(classB.methodD)))} valid
${classA.methodA.methodB} valid
${classA.methodA.methodB()} not valid
${methodB(methodC(classB.methodD)))} not valid
b. I need to extract those information
${classA.methodA.methodB(classB.methodC(classB.methodD)))} should return
1. classA
2. methodA
3. methodB(classB.methodC(classB.methodD)))
I've created this code
$expression = '${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}';
$pattern = '/\$\{(?:([a-zA-Z0-9]+)\.)(?:([a-zA-Z\d]+)\.)*([a-zA-Z\d.()]+)\}/';
if(preg_match($pattern, $expression, $matches))
{
echo 'found'.'<br/>';
for($i = 0; $i < count($matches); $i++)
echo $i." ".$matches[$i].'<br/>';
}
The result is :
found
0 ${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}
1 myvalue
2 fsdf
3 blo(fsdf.fsfds(fsfs.fs))
Obviously I'm having difficult to extract repetitive methods and it is not validating it properly (honestly I left it for last once i solve the other problem) so empty parenthesis are allowed and it is not checking whether or not that once a parenthesis is opened it must be closed.
Thanks all
UPDATE
X m.buettner
Thanks for your help. I did a fast try to your code but it gives a very small issue, although i can by pass it. The issue is the same of one of my prior codes that i didn't post here which is when i try this string :
$expression = '${myvalue.fdsfs}';
with your pattern definition it shows :
found
0 ${myvalue.fdsfs}
1 myvalue.fdsfs
2 myvalue
3
4 fdsfs
As you can see the third line is catched as a white space which is not present. I couldn't understand why it was doing that so can you suggest me how to or i do have to live with it due to php regex limits?
That said i just can tell you thank you. Not only you answered to my problem but also you tried to input as much as information as possible with many suggestion on proper path to follow when developing patterns.
One last thing i (stupid) forgot to add one little important case which is multiple parameters divided by a comma so
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB)}';
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB,classD.mehtodDA)}';
must be valid.
I edited to this
$expressionPattern =
'/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
(?: # non-capturing group for multiple arguments
[,] # literal ,
(?1) # recursively apply the pattern inside group 1 on parameters
)* # end of multiple arguments group; repeat 0 or more times
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
X Casimir et Hippolyte
Your suggestion also is good but it implies a little complex situation when using this code. I mean the code itself is easy to understand but it get less flexible. That said it also gave me a lot of information that surely can be helpful in the future.
X Denomales
Thanks for your support but your code falls when i try this :
$sourcestring='${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}';
the result is :
Array
(
[0] => Array
(
[0] => ${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}
)
[1] => Array
(
[0] => classA1
)
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1.methodB1(classB.methodC(classB.methodD))
)
)
It should be
[2] => Array
(
[0] => methodA0.methodA1
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
or
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1
)
[4] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
This is a tough one. Recursive patterns are often beyond what's possible with regular expressions and even if it is possible, it can lead to very hard to expressions that are very hard to understand and maintain.
You are using PHP and therefore PCRE, which indeed supports the recursive regex constructs (?n). As your recursive pattern is quite regular it is possible to find a somewhat practical solution using regex.
One caveat I should mention right away: since you allow and arbitrary number of "intermediate" method calls per level (in your snippet fdsfs and fsdf), you can not get all of these in separate captures. That is simply impossible with PCRE. Each match will always yield the same finite number of captures, determined by the amount of opening parentheses your pattern contains. If a capturing group is used repeatedly (e.g. using something like ([a-z]+\.)+) then every time the group is used the previous capture will be overwritten and you only get the last instance. Therefore, I recommend that you capture all the "intermediate" method calls together, and then simply explode that result.
Likewise you couldn't (if you wanted to) get the captures of multiple nesting levels at once. Hence, your desired captures (where the last one includes all nesting levels) are the only option - you can then apply the pattern again to that last match to go a level further down.
Now for the actual expression:
$pattern = '/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
A few general notes: for complicated expressions (and in regex flavors that support it), always use the free-spacing x modifier which allows you to introduce whitespace and comments to format the expression to your desires. Without them, the pattern looks like this:
'/^[$][{](([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))[}]$/ix'
Even if you've written the regex yourself and you are the only one who ever works on the project - try understanding this a month from now.
Second, I've slightly simplified the pattern by using the case-insenstive i modifier. It simply removes some clutter, because you can omit the upper-case variants of your letters.
Third, note that I use single-character classes like [$] and [.] to escape characters where this is possible. That is simply a matter of taste, and you are free to use the backslash variants. I just personally prefer the readability of the character classes (and I know others here disagree), so I wanted to present you this option as well.
Fourth, I've added anchors around your pattern, so that there can be no invalid syntax outside of the ${...}.
Finally, how does the recursion work? (?n) is similar to a backreference \n, in that it refers to capturing group n (counted by opening parentheses from left to right). The difference is that a backreference tries to match again what was matched by group n, whereas (?n) applies the pattern again. That is (.)\1 matches any characters twice in a row, whereas (.)(?1) matches any character and then applies the pattern again, hence matching another arbitrary character. If you use one of those (?n) constructs within the nth group, you get recursion. (?0) or (?R) refers to the entire pattern. That is all the magic there is.
The above pattern applied to the input
'${abc.def.ghi.jkl(mno.pqr(stu.vwx))}'
will result in the captures
0 ${abc.def.ghi.jkl(mno.pqr(stu.vwx))}
1 abc.def.ghi.jkl(mno.pqr(stu.vwx))
2 abc
3 def.ghi.
4 jkl(mno.pqr(stu.vwx))
Note that there are a few differences to the outputs you actually expected:
0 is the entire match (and in this case just the input string again). PHP will always report this first, so you cannot get rid of it.
1 is the first capturing group which encloses the recursive part. You don't need this in the output, but (?n) unfortunately cannot refer to non-capturing groups, so you need this as well.
2 is the class name as desired.
3 is the list of intermediate method names, plus a trailing period. Using explode it's easy to extract all the method names from this.
4 is the final method name, with the optional (recursive) argument list. Now you could take this, and apply the pattern again if necessary. Note that for a completely recursive approach you might want to modify the pattern slightly. That is: strip off the ${ and } in a separate first step, so that the entire pattern has the exact same (recursive) pattern as the final capture, and you can use (?0) instead of (?1). Then match, remove method name, and parentheses, and repeat, until you get no more parentheses in the last capture.
For more information on recursion, have a look at PHP's PCRE documentation.
To illustrate my last point, here is a snippet that extracts all elements recursively:
if(!preg_match('/^[$][{](.*)[}]$/', $expression, $matches))
echo 'Invalid syntax.';
else
traverseExpression($matches[1]);
function traverseExpression($expression, $level = 0) {
$pattern = '/^(([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))$/i';
if(preg_match($pattern, $expression, $matches)) {
$indent = str_repeat(" ", 4*$level);
echo $indent, "Class name: ", $matches[2], "<br />";
foreach(explode(".", $matches[3], -1) as $method)
echo $indent, "Method name: ", $method, "<br />";
$parts = preg_split('/[()]/', $matches[4]);
echo $indent, "Method name: ", $parts[0], "<br />";
if(count($parts) > 1) {
echo $indent, "With arguments:<br />";
traverseExpression($parts[1], $level+1);
}
}
else
{
echo 'Invalid syntax.';
}
}
Note again, that I do not recommend using the pattern as a one-liner, but this answer is already long enough.
you can do validation and extraction with the same pattern, example:
$subjects = array(
'${classA.methodA.methodB(classB.methodC(classB.methodD))}',
'${classA.methodA.methodB}',
'${classA.methodA.methodB()}',
'${methodB(methodC(classB.methodD))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE)))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE())))}'
);
$pattern = <<<'LOD'
~
# definitions
(?(DEFINE)(?<vn>[a-z]\w*+))
# pattern
^\$\{
(?<classA>\g<vn>)\.
(?<methodA>\g<vn>)\.
(?<methodB>
\g<vn> (
\( \g<vn> \. \g<vn> (?-1)?+ \)
)?+
)
}$
~x
LOD;
foreach($subjects as $subject) {
echo "\n\nsubject: $subject";
if (preg_match($pattern, $subject, $m))
printf("\nclassA: %s\nmethodA: %s\nmethodB: %s",
$m['classA'], $m['methodA'], $m['methodB']);
else
echo "\ninvalid string";
}
Regex explanation:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
At the end of the pattern you can see the modifier x that allow spaces, newlines and commentary inside the pattern.
First the pattern begin with the definition of a named group vn (variable name), here you can define how classA or methodB looks like for all the pattern. Then you can refer to this definition in all the pattern with \g<vn>
Note that you can define if you want different type of name for classes and method adding other definitions. Example:
(?(DEFINE)(?<cn>....)) # for class name
(?(DEFINE)(?<mn>....)) # for method name
The pattern itself:
(?<classA>\g<vn>) capture in the named group classA with the pattern defined in vn
same thing for methodA
methodB is different cause it can contain nested parenthesis, it's the reason why i use a recursive pattern for this part.
Detail:
\g<vn> # the method name (methodB)
( # open a capture group
\( # literal opening parenthesis
\g<vn> \. \g<vn> # for classB.methodC⑴
(?-1)?+ # refer the last capture group (the actual capture group)
# one or zero time (possessive) to allow the recursion stop
# when there is no more level of parenthesis
\) # literal closing parenthesis
)?+ # close the capture group
# one or zero time (possessive)
# to allow method without parameters
⑴you can replace it by \g<vn>(?>\.\g<vn>)+ if you want to allow more than one method.
About possessive quantifiers:
You can add + after a quantifier ( * + ? ) to make it possessive, the advantage is that the regex engine know that it don't have to backtrack to test other ways to match with a subpattern. The regex is then more efficient.
Description
This expression will match and capture only ${classA.methodA.methodB(classB.methodC(classB.methodD)))} or ${classA.methodA.methodB} formats.
(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)
Groups
Group 0 gets the entire match from the start dollar sign to the close squiggly bracket
gets the Class
gets the first method
gets the second method followed by all the text upto but not including the close squiggly bracket. If this group has open round brackets which are empty () then this match will fail
PHP Code Example:
<?php
$sourcestring="${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
${classA2.methodA2.methodB2}
${classA3.methodA3.methodB3()}
${methodB4(methodC4(classB4.methodD)))}
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}";
preg_match_all('/(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
[1] =>
${classA2.methodA2.methodB2}
[2] =>
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}
)
[1] => Array
(
[0] => classA1
[1] => classA2
[2] => classA5
)
[2] => Array
(
[0] => methodA1
[1] => methodA2
[2] => methodA5
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD)))
[1] => methodB2
[2] => methodB5(classB.methodC(classB.methodD)))
)
)
Disclaimers
I added a number to the end of the class and method names to help illistrate what's happening in the groups
The sample text provided in the OP does not have balanced open and close round brackets.
Although () will be disallowed (()) will be allowed

Combine multiple match regular expression into one and get the matching ones

I have a list of regular expressions:
suresnes|suresne|surenes|surene
pommier|pommiers
^musique$
^(faq|aide)$
^(file )?loss( )?less$
paris
faq <<< this match twice
My use case is that each pattern which got a match display a link to my user,
so I can have multiple pattern matching.
I test thoses patterns against a simple string of text "live in paris" / "faq" / "pom"...
The simple way to do it is to loop over all the patterns with a preg_match, but I'm will do that a lot on a performance critical page, so this look bad to me.
Here is what I have tried: combining all thoses expressions into one with group names:
preg_match("#(?P<group1>^(faq|aide|todo|paris)$)|(?P<group2>(paris)$)#im", "paris", $groups);
As you can see, each pattern is grouped: (?P<GROUPNAME>PATTERN) and they are all separated by a pipe |.
The result is not what I expect, as only the first group matching is returned. Look like when a match occurs the parsing is stopped.
What I want is the list of all the matching groups. preg_match_all does not help neither.
Thanks!
How about:
preg_match("#(?=(?P<group1>^(faq|aide|todo|paris)$))(?=(?P<group2>(paris)$))#im", "paris", $groups);
print_r($groups);
output:
Array
(
[0] =>
[group1] => paris
[1] => paris
[2] => paris
[group2] => paris
[3] => paris
[4] => paris
)
The (?= ) is called lookahead
Explanation of the regex:
(?= # start lookahead
(?P<group1> # start named group group1
^ # start of string
( # start catpure group #1
faq|aide|todo|paris # match any of faq, aide, todo or paris
) # end capture group #1
$ # end of string
) # end of named group group1
) # end of lookahead
(?= # start lookahead
(?P<group2> # start named group group2
( # start catpure group #2
paris # paris
) # end capture group #2
$ # end of string
) # end of named group group2
) # end of lookahead
Try this approach:
#/ define input string
$str_1 = "{STRING HERE}";
#/ Define regex array
$reg_arr = array(
'suresnes|suresne|surenes|surene',
'pommier|pommiers',
'^musique$',
'^(faq|aide)$',
'^(file )?loss( )?less$',
'paris',
'faq'
);
#/ define a callback function to process Regex array
function cb_reg($reg_t)
{
global $str_1;
if(preg_match("/{$reg_t}/ims", $str_1, $matches)){
return $matches[1]; //replace regex pattern with the result of matching is the key trick here
//or return $matches[0]; if you dont want to get captured parenthesized subpatterns
//or you could return an array of both. its up to you how to do it.
}else{
return '';
}
}
#/ Apply array Regex via much faster function (instead of a loop)
$results = array_map('cb_reg', $reg_arr); //returns regex results
$results = array_diff($results, array('')); //remove empty values returned
Basically, this is the fastest way I could think of.
You can't combine say 100s of Regex into one call, as it would be very complex regex to build and will have several chances to fail matching. This is one of the best way to do it.
In my opinion, combining large number of Regex into 1 regex (if possibly achieved) will be slower to execute with preg_match, as compared to this approach of Callback on Arrays. Just remember, the key here is Callback function on array member values, which is fastest way to handle array for your and similar situation in php.
Also note,
The callback on Array is not equal to looping the Array. Looping is slower and has an n from algorithm analysis. But callback on array elements is internal and is very fast as compared.
You can combine all of your regexes with "|" in between them. Then apply this: http://www.rexegg.com/regex-optimizations.html, which will optimize it, collapse common expressions, etc.

Regex to match expression with multiple parentheses, one within each other

I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:
// Basic example
__('Show in English') => Show in English
// Get the message and the name of the i18n file
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at
The regex expression that works for those cases (there are some other cases taken into account) is:
__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?
However if the expression is in multiple lines it doesn't work. I have to include dotail /s, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:
// Detect with multiple lines
echo __('title_in_place', array(
'%title%' => $place['title']
), 'welcome-user'); ?>
There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __() or how many parentheses there are, it "counts" the number of openings and expects that number of closings.
Is it possible? How? Thanks a lot!
Yes. First, here is the classic example for simple nested brackets (parentheses):
\(([^()]|(?R))*\)
or faster versions which use a possesive quantifier:
\(([^()]++|(?R))*\)
or (equivalent) atomic grouping:
\((?>[^()]+|(?R))*\)
But you can't use the: (?R) "match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...
Solution: Use group $1 (recursive) subroutine call: (?1)
<?php // test.php Rev:20120625_2200
$re_message = '/
# match __(...(...)...) message lines (having arbitrary nesting depth).
__\( # Outermost opening bracket (with leading __().
( # Group $1: Bracket contents (subroutine).
(?: # Group of bracket contents alternatives.
[^()"\']++ # Either one or more non-brackets, non-quotes,
| "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*" # or a double quoted string,
| \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\' # or a single quoted string,
| \( (?1) \) # or a nested bracket (repeat group 1 here!).
)* # Zero or more bracket contents alternatives.
) # End $1: recursed subroutine.
\) # Outermost closing bracket.
.* # Match remainder of line following __()
/mx';
$data = file_get_contents('testdata.txt');
$count = preg_match_all($re_message, $data, $matches);
printf("There were %d __(...) messages found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
printf(" message[%d]: %s\n", $i + 1, $matches[0][$i]);
}
?>
Note that this solution handles balanced parentheses (inside the "__(...)" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)" and ignores any parentheses that may appear inside these quoted strings. Good luck.
*
Matching balanced parentheses is not possible with regular expressions (unless you use an engine with non-standard non-regular extensions, but even then it's still a bad idea and will be hard to maintain).
You could use a regular expression to find lines containing potential matches, then iterate over the string character by character counting the number of open and close parentheses until you find the index of the matching closing parenthesis.
for me use such expression
(\(([^()]+)\))
i try find it
* 1) (1+2)
* 2) (1+2)+(3+2)
* 3) (IF 1 THEN 1 ELSE 0) > (IF 2 THEN 1 ELSE 1)
* 4) (1+2) -(4+ (3+2))
* 5) (1+2) -((4+ (3+2)-(6-7)))
The only way I'm aware of pulling this off is with balanced group definitions. That's a feature in the .NET flavor of regular expressions, and is explained very well in this article.
And as Qtax noted, this can be done in PCRE with (?R) as decribed in their documentation.
Or this could also be accomplished by writing a custom parser. Basically the idea would be to maintain a variable called ParenthesesCount as you're parsing from left to right. You'd increment ParenthesesCount every time you see ( and decrement for every ). I've written a parser recently that handles nested parentheses this way.

Regex issue with named captured pairs

I have the following value:
start=2011-03-10T13:00:00Z;end=2011-03-30T13:00:00Z;scheme=W3C-DTF
I use the following regular expression to strip out the 'start' and 'end' dates and assign them to their own named capture pair:
#^start=(?P<publishDate>.+);end=(?P<expirationDate>.+);#ix'
Probably not the absolute best REGEX, but it works well enough if both 'start' and 'end' values are present.
Now, what I need to do is still match 'publishDate' if 'expirationDate' is missing and vise-versa.
How can I do this using a single expression? I'm not the greatest at regular expressions and I'm starting to wander off into the more advanced stuff, so any help with this would be greatly appreciated.
Thanks!
UPDATE:
Thanks to Mr. Chung, I have resolved this issue with the following expression:
#^(start=(?P<publishDate>.*?);)?(end=(?P<expirationDate>.*?);)?#xi
As always, thank you so much for all of your help, everyone. :)
Use (...)? for an optional section
^(start=(?P<publishDate>.+);)?(end=(?P<expirationDate>.+));)?
These both set the named buffer to a value (instead of null or undefined)
I recommend the first one.
1. To find either/both in any order:
/^(?=.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?=.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
/^(?= # from beginning, look ahead for start
.*\b # any character 0 or more times (backtrack to match 'start')
start=(?P<publishDate>.*?); # put start date in publish
| (?P<publishDate>) # OR, put empty string publish
)
(?= # from beginning, look ahead for end
.*\b # same criteria as above ...
end=(?P<expirationDate>.*?);
| (?P<expirationDate>)
)
/ix
2. To find either/both in start/end order:
/^(?:.*\bstart=(?P<publishDate>.*?);|(?P<publishDate>))(?:.*\bend=(?P<expirationDate>.*?);|(?P<expirationDate>))/ix
Edit -
#Josh Davis - I had to go searching PCRE.org, some great stuff there.
With Perl there is no problem with duplicate names.
Docs: "If multiple groups have the same name then it refers to the leftmost defined group in the current match."
The is never a problem when used in an alternation.
With PCRE ..
Duplicate names will work properly with PHP if its used with the branch reset.
Branch reset insures duplicate names will occupy the same capture group.
After that, using the dup names constant, $match['name'] will either contain a value
or an empty string, but it will exist.
ie:
(?J) = PCRE_DUPNAMES
(?| ... | ...) = Branch reset
This works:
/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x
Try it here: http://www.ideone.com/zYd24
<?php
$string = "start=2011-03-(start)10T13:00:00Z;end=2011-03-(end)30T13:00:00Z;scheme=W3C-DTF";
preg_match('/(?Ji)^
(?= (?| .* end = (?P<expirationDate> .*? ); | (?P<expirationDate>)) )
(?= (?| .* start = (?P<publishDate> .*? ); | (?P<publishDate>)) )
/x', $string, $matches);
echo "Published = ",$matches['publishDate'],"\n";
echo "Expires = ",$matches['expirationDate'],"\n";
print_r($matches);
?>
Output
Published = 2011-03-(start)10T13:00:00Z
Expires = 2011-03-(end)30T13:00:00Z
Array
(
[0] =>
[expirationDate] => 2011-03-(end)30T13:00:00Z
[1] => 2011-03-(end)30T13:00:00Z
[publishDate] => 2011-03-(start)10T13:00:00Z
[2] => 2011-03-(start)10T13:00:00Z
)
If 'start=;' isn't present when the corresponding date is absent, the Stephen Chung's code is OK
Otherwise I think that replacing '+' with '*' is enough:
#^start=(?P<publishDate>.*?);end=(?P<expirationDate>.*?);#ix'
By the way, the '?' is necessary to make the point ungreedy in every code

Categories