I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:
// Basic example
__('Show in English') => Show in English
// Get the message and the name of the i18n file
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at
The regex expression that works for those cases (there are some other cases taken into account) is:
__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?
However if the expression is in multiple lines it doesn't work. I have to include dotail /s, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:
// Detect with multiple lines
echo __('title_in_place', array(
'%title%' => $place['title']
), 'welcome-user'); ?>
There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __() or how many parentheses there are, it "counts" the number of openings and expects that number of closings.
Is it possible? How? Thanks a lot!
Yes. First, here is the classic example for simple nested brackets (parentheses):
\(([^()]|(?R))*\)
or faster versions which use a possesive quantifier:
\(([^()]++|(?R))*\)
or (equivalent) atomic grouping:
\((?>[^()]+|(?R))*\)
But you can't use the: (?R) "match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...
Solution: Use group $1 (recursive) subroutine call: (?1)
<?php // test.php Rev:20120625_2200
$re_message = '/
# match __(...(...)...) message lines (having arbitrary nesting depth).
__\( # Outermost opening bracket (with leading __().
( # Group $1: Bracket contents (subroutine).
(?: # Group of bracket contents alternatives.
[^()"\']++ # Either one or more non-brackets, non-quotes,
| "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*" # or a double quoted string,
| \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\' # or a single quoted string,
| \( (?1) \) # or a nested bracket (repeat group 1 here!).
)* # Zero or more bracket contents alternatives.
) # End $1: recursed subroutine.
\) # Outermost closing bracket.
.* # Match remainder of line following __()
/mx';
$data = file_get_contents('testdata.txt');
$count = preg_match_all($re_message, $data, $matches);
printf("There were %d __(...) messages found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
printf(" message[%d]: %s\n", $i + 1, $matches[0][$i]);
}
?>
Note that this solution handles balanced parentheses (inside the "__(...)" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)" and ignores any parentheses that may appear inside these quoted strings. Good luck.
*
Matching balanced parentheses is not possible with regular expressions (unless you use an engine with non-standard non-regular extensions, but even then it's still a bad idea and will be hard to maintain).
You could use a regular expression to find lines containing potential matches, then iterate over the string character by character counting the number of open and close parentheses until you find the index of the matching closing parenthesis.
for me use such expression
(\(([^()]+)\))
i try find it
* 1) (1+2)
* 2) (1+2)+(3+2)
* 3) (IF 1 THEN 1 ELSE 0) > (IF 2 THEN 1 ELSE 1)
* 4) (1+2) -(4+ (3+2))
* 5) (1+2) -((4+ (3+2)-(6-7)))
The only way I'm aware of pulling this off is with balanced group definitions. That's a feature in the .NET flavor of regular expressions, and is explained very well in this article.
And as Qtax noted, this can be done in PCRE with (?R) as decribed in their documentation.
Or this could also be accomplished by writing a custom parser. Basically the idea would be to maintain a variable called ParenthesesCount as you're parsing from left to right. You'd increment ParenthesesCount every time you see ( and decrement for every ). I've written a parser recently that handles nested parentheses this way.
Related
This is an example input string:
((#1662# - #[Réz-de-chaussée][Thermostate][Temperature Actuel]#) > 4) && #1304# == 1 and #[Aucun][Template][ReviseConfort#templateSuffix#]#
and these are the required output strings:
#1662#
#[Réz-de-chaussée][Thermostate][Temperature Actuel]#
#1304#
#[Aucun][Template][ReviseConfort#templateSuffix#]#
I tried this regex, but it doesn't work:
~("|\').*?\1(*SKIP)(*FAIL)|\#(?:[^##]|(?R))*\#~
preg_match_all( '/\#((\d{1,4})|(\[[^0-9]+\]))[\#$]/'
, '((#1662# - #[Réz-de-chaussée][Thermostate][Temperature Actuel]$) > 4) && #1304$ == 1 and #[Aucun][Template][ReviseConfort#templateSuffix#]#'
, $matches
);
foreach($matches[0] as $match)
echo $match.PHP_EOL;
This situation is not particularly suited for recursion. It would be better to use a normal regex.
It's hard to tell for certain if the following will work for all other possible inputs, as you've only supplied two limited examples.
At least in those examples, the required closing #'s are followed either by a ), a space, or the end of the line. Using a negative look-ahead for these values allows us to capture the internally nested #'s:
#(?:[^#]|#(?![\s)]|$))+#
Demo
Try this one (?:#[^#[]+#|##(?:.+?]#){2}|#(?:.+?]#){1})
Explanation:
(?:
// Grabs everything between 1 opening # and 1 closing # tag that`s not #[ chars
#[^#[]+#|
// Grabs everything between 2 opening # and 2 closing ]# tags
##(?:.+?]#){2}|
// Grabs everything between 1 opening # and 1 closing ]# tag
#(?:.+?]#){1}
)
I'm pretty lousy at regex, and need help with the following scenario. I need to locate and replace text that has a common structure, but one aspect will be different:
here is a string (with 3 values)
here is another string (with 5 values)
In the above examples, I need to locate and then replace the value in parenthesis. I can't search by parens alone, as the string may contain other parens. But the value in the parens that needs to be replaced is consistently constructed: (with # values) -- the only difference will be the number.
So ideally the regex returns (with 3 values) and (with 5 values) so I can use a simple str_replace to change the text.
This is regex in a PHP script.
Try with this regex :
\(with\s+\d+\s+values\)
Demo here
The following regex should work for you:
/\(with (\d+) values\)/g
This matches strings of the specified format and gives the value in a capture group so it may be used in the replace. The g flag at the end is only needed if you have multiple of these in one string.
Demo here
If, however, there can only be one digit, then the following will work:
/\(with (\d) values\)/g
Or, if the number can only be a digit greater than 1, for example, then the following:
/\(with ([2-9]) values\)/g
If I got you right, you are looking for exactly three or five items within parentheses (comma separated).
This could be accomplished by
\( # "(" literally
(?:[^,()]+,){2} # not , or ( or ) exactly two times
(?:(?:[^,()]+,){2})? # repeated
[^,()]+ # without the comma in the end
\) # the closing parenthesis
See a demo on regex101.com.
If you're really looking only for two variant of strings, you could very easily do
\(with (?:3|5) values\)
In general
\(with \d+ values\)
as proposed by #SchoolBoy.
Something like this maybe
$str ="here is another string (with 5 values)";
preg_match_all("/\(with (\d+) values\)/", $str, $out );
print_r( $out );
Output:
Array
(
[0] => Array
(
[0] => (with 5 values)
)
[1] => Array
(
[0] => 5
)
)
Here at ideone...
It uses the regex
\(with (\d+) values\)
that matches the literal opening parentheses followed by the string with # values, capturing the actual number #, and finally the closing parentheses.
It returns the complete match (the parenthesized string) in the first dimension and the actual number in the second.
I need to parse and process input data pushed into our webservice as (UTF-8) text files from a 3rd party. The input files contain tags in the general form
<{ _N ('some_domain_id','this can be an arbitary string',{'a':'b','c':'d'}) }>
-- -- -------------- ------------------------------ ----------------- --
^ ^ ^
i need to | this part is
extract this and this (payload) optional
these tags can appear anyway in the textfile, no assumptions can be made about their distribution and whats between the tags. Also <{,_N and }>are present for any given valid tag, but there might be spaces in between without disrupting the values (e.g. between <{ and _N)
With that info, my initial test set was limited and my current implementation is a regex along with a split of the result at the ,
Regex /<{\s*_N\s*\(([^\)]*)\)\s*}>?/g (Example: https://regex101.com/r/NuJD2V/1)
Then split resulting match 'some_domain_id' , 'this can be an arbitary string',{'a':'b','c':'d'}with str_getcsv($match,',','\'','\\')
use first two segments of str_getcsv result, dicard other results as they are optional
After that, some_domain_id and this can be an arbitary string can be trimmed and processed as needed
The service is up for a while now and i had to realize that although the vast majority of tags is correctly catched, there is a small number of tags that contain anomalies and are not recognized by this implementation.
Caveats (things that can happen in the payload part):
brackets in payload
escaped quotes in payload \'
optional modifiers after the outermost brackets of the _N call (see below)
Here are some sample tags i identified that can not be parsed or produce wrong results (even worse than not recognizing).
<{_N( 'some_domain_id' , 'this can ( be an arbitary ) string',{'a':'b','c':'d'})}>
- Not recognized, note the brackets, they can occur anywhere in the data string, they don't even need to be balanced (Example: https://regex101.com/r/BCiaaj/1)
<{_N( 'some_domain_id' , 'this can be an arbitary string' {'a':'b','c':'d'})|e('modifier')}>
<{_N( 'some_domain_id' , 'this can be an arbitary string')|e('modifier')}>
Not recognized, note the extra (optional) modifier after the outermost brackets of the _N element. the modifier can consist of different letters (e,r,w) and an arbitary string argument, also there might be spaces around the chain operator | (Example: https://regex101.com/r/XmR2uO/1)
Experimentally, i tried a few other regexes already, but they always fail on one or more of the tags in my extended testset, e.g.
/_N\s*(\(\s*(?:\(??[^(]*?\s*\)))+/ - catches the modifier case, but fails on brackets in the relevant string
So my questions as i am not a real regex expert
is this solvable with a regex and if so, can anyone hint me in the right direction?
is there a better solution viable in vanilla php 7+ without installing/using some external library
Any help is highly appreciated!
You may use
<{\s*_N\s*\(\s*'([^\\']*(?:\\.[^\\']*)*)'\s*,\s*'([^\\']*(?:\\.[^\\']*)*)'\s*(.*?)}>
See the regex demo
Details:
<{\s* - a <{ plus 0+ whitespaces
_N - tag start
\s*\(\s* - a ( enclosed with 0+ whitespaces
'([^\\']*(?:\\.[^\\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #1)
\s*,\s* - a , enclosed with 0+ whitespaces
'([^\\']*(?:\\.[^\\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #2)
\s* - 0+ whitespaces
(.*?) - any 0+ chars as few as possible up to the first
}> - literal char sequence }>.
Why could you not just retrieve the parts you need (that being in the single quotes);
//example 1
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\',{\'a\':\'b\',\'c\':\'d\'})}>';
test_pregex($str);
//example 2
$str = '<{_N(\'some_domain_id\' , \'this can ( be an arbitary ) string\' ,{\'a\':\'b\',\'c\':\'d\'})} >';
test_pregex($str);
//example 3
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\')|e(\'modifier\')}>';
test_pregex($str, '\'modifier\'');
function test_pregex($str, $optional = "{'a':'b','c':'d'}") {
$re = '/\'([^\']*?)\'|(\{\'[^\']*?\'.+?})/m';
preg_match_all($re, $str, $matches);
$matches = $matches[0];
var_export($matches);
assert($matches[0] == "'some_domain_id'");
assert($matches[1] == "'this can ( be an arbitary ) string'");
assert($matches[2] == $optional);
}
Output will be all three cases with no assertion warnings. You can then process what you require further.
I want to match nested Wiki functions or wiki parser functions that start with a functionname and then a colon, but as soon as I try to get the recursive pcre regex working with a 1st level test I fail to construct a regex pattern. I want to match with the test that it starts with {{aFunctionName: followed by colon, in regex {{[\w\d]+: the test text can look like
1 {{DEFAULTSORT: shall be matched {{PAGENAME}} }}
2 {{DEFAULTSORT: shall be matched }}
3 {{DEFAULTSORT: shall be matched {{PAGENAMEE: some text}} }}
4 Lorem ipsum {{VARIABLE shall not be matched}}
5 {{Some template|param={{VARIABLE}} shall not be matched }}
I'm able to
to get any nested curly braces using {{(?:(?:(?!{{|}}).)++|(?R))*}} which gets line 1, 2, 3, partially 4 and 5
to get any nested wiki function using ({{(?:[\w\d]+:)(?:(?:(?!{{|}}).)++|(?1))*}}) which only gets line 3 but I also want to match lines 1 and 2.
But I have no idea how to construct a regex pattern that tests something like (written as pseudo code):
{{match1st-level-Function: then anything {{nested}} or not nested }}
{{do not match simple {{nested}} things}}
Any help from a pcre regex expert? Thank you!
Use something like this:
{{\w+:([^{}]*+(?:{{(?1)}}[^{}]*)*+)}}
To obtain a recursive pattern, the use of (?R) isn't mandatory, you can also refer to any capture group opened before with its number, its relative position (from the current position), or its name (when you use named captures).
Other possible syntaxes are:
{{\w+:([^{}]*+(?:{{(?-1)}}[^{}]*)*+)}}
# ^------ relative reference: the last group on the left
{{\w+:([^{}]*+(?:{{\g<1>}}[^{}]*)*+)}}
# ^----- oniguruma syntax
{{\w+:([^{}]*+(?:{{\g<-1>}}[^{}]*)*+)}}
# ^----- relative with oniguruma syntax
{{\w+:(?<name>[^{}]*+(?:{{\g<name>}}[^{}]*)*+)}}
# ^---- named capture (oniguruma)
{{\w+:(?<name>[^{}]*+(?:{{(?&name)}}[^{}]*)*+)}}
# ^---- named capture (perl syntax)
All these syntaxes can be used with pcre.
If you absolutely want to use the whole pattern for your recursion, you can eventually use a conditional statement to test if you are in a nested part or not:
{{(?(R)|\w+:)[^{}]*+(?:(?R)[^{}]*)*+}}
The conditional is (?(R)|\w+:) and follows this schema: (?(condition) True | False)
Although I have enough knowledge of regex in pseudocode, I'm having trouble to translate what I want to do in php regex perl.
I'm trying to use preg_match to extract part of my expression.
I have the following string ${classA.methodA.methodB(classB.methodC(classB.methodD)))} and i need to do 2 things:
a. validate the syntax
${classA.methodA.methodB(classB.methodC(classB.methodD)))} valid
${classA.methodA.methodB} valid
${classA.methodA.methodB()} not valid
${methodB(methodC(classB.methodD)))} not valid
b. I need to extract those information
${classA.methodA.methodB(classB.methodC(classB.methodD)))} should return
1. classA
2. methodA
3. methodB(classB.methodC(classB.methodD)))
I've created this code
$expression = '${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}';
$pattern = '/\$\{(?:([a-zA-Z0-9]+)\.)(?:([a-zA-Z\d]+)\.)*([a-zA-Z\d.()]+)\}/';
if(preg_match($pattern, $expression, $matches))
{
echo 'found'.'<br/>';
for($i = 0; $i < count($matches); $i++)
echo $i." ".$matches[$i].'<br/>';
}
The result is :
found
0 ${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}
1 myvalue
2 fsdf
3 blo(fsdf.fsfds(fsfs.fs))
Obviously I'm having difficult to extract repetitive methods and it is not validating it properly (honestly I left it for last once i solve the other problem) so empty parenthesis are allowed and it is not checking whether or not that once a parenthesis is opened it must be closed.
Thanks all
UPDATE
X m.buettner
Thanks for your help. I did a fast try to your code but it gives a very small issue, although i can by pass it. The issue is the same of one of my prior codes that i didn't post here which is when i try this string :
$expression = '${myvalue.fdsfs}';
with your pattern definition it shows :
found
0 ${myvalue.fdsfs}
1 myvalue.fdsfs
2 myvalue
3
4 fdsfs
As you can see the third line is catched as a white space which is not present. I couldn't understand why it was doing that so can you suggest me how to or i do have to live with it due to php regex limits?
That said i just can tell you thank you. Not only you answered to my problem but also you tried to input as much as information as possible with many suggestion on proper path to follow when developing patterns.
One last thing i (stupid) forgot to add one little important case which is multiple parameters divided by a comma so
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB)}';
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB,classD.mehtodDA)}';
must be valid.
I edited to this
$expressionPattern =
'/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
(?: # non-capturing group for multiple arguments
[,] # literal ,
(?1) # recursively apply the pattern inside group 1 on parameters
)* # end of multiple arguments group; repeat 0 or more times
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
X Casimir et Hippolyte
Your suggestion also is good but it implies a little complex situation when using this code. I mean the code itself is easy to understand but it get less flexible. That said it also gave me a lot of information that surely can be helpful in the future.
X Denomales
Thanks for your support but your code falls when i try this :
$sourcestring='${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}';
the result is :
Array
(
[0] => Array
(
[0] => ${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}
)
[1] => Array
(
[0] => classA1
)
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1.methodB1(classB.methodC(classB.methodD))
)
)
It should be
[2] => Array
(
[0] => methodA0.methodA1
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
or
[2] => Array
(
[0] => methodA0
)
[3] => Array
(
[0] => methodA1
)
[4] => Array
(
[0] => methodB1(classB.methodC(classB.methodD))
)
)
This is a tough one. Recursive patterns are often beyond what's possible with regular expressions and even if it is possible, it can lead to very hard to expressions that are very hard to understand and maintain.
You are using PHP and therefore PCRE, which indeed supports the recursive regex constructs (?n). As your recursive pattern is quite regular it is possible to find a somewhat practical solution using regex.
One caveat I should mention right away: since you allow and arbitrary number of "intermediate" method calls per level (in your snippet fdsfs and fsdf), you can not get all of these in separate captures. That is simply impossible with PCRE. Each match will always yield the same finite number of captures, determined by the amount of opening parentheses your pattern contains. If a capturing group is used repeatedly (e.g. using something like ([a-z]+\.)+) then every time the group is used the previous capture will be overwritten and you only get the last instance. Therefore, I recommend that you capture all the "intermediate" method calls together, and then simply explode that result.
Likewise you couldn't (if you wanted to) get the captures of multiple nesting levels at once. Hence, your desired captures (where the last one includes all nesting levels) are the only option - you can then apply the pattern again to that last match to go a level further down.
Now for the actual expression:
$pattern = '/
^ # beginning of the string
[$][{] # literal ${
( # group 1, used for recursion
( # group 2 (class name)
[a-z\d]+ # one or more alphanumeric characters
) # end of group 2 (class name)
[.] # literal .
( # group 3 (all intermediate method names)
(?: # non-capturing group that matches a single method name
[a-z\d]+ # one or more alphanumeric characters
[.] # literal .
)* # end of method name, repeat 0 or more times
) # end of group 3 (intermediate method names);
( # group 4 (final method name and arguments)
[a-z\d]+ # one or or more alphanumeric characters
(?: # non-capturing group for arguments
[(] # literal (
(?1) # recursively apply the pattern inside group 1
[)] # literal )
)? # end of argument-group; make optional
) # end of group 4 (method name and arguments)
) # end of group 1 (recursion group)
[}] # literal }
$ # end of the string
/ix';
A few general notes: for complicated expressions (and in regex flavors that support it), always use the free-spacing x modifier which allows you to introduce whitespace and comments to format the expression to your desires. Without them, the pattern looks like this:
'/^[$][{](([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))[}]$/ix'
Even if you've written the regex yourself and you are the only one who ever works on the project - try understanding this a month from now.
Second, I've slightly simplified the pattern by using the case-insenstive i modifier. It simply removes some clutter, because you can omit the upper-case variants of your letters.
Third, note that I use single-character classes like [$] and [.] to escape characters where this is possible. That is simply a matter of taste, and you are free to use the backslash variants. I just personally prefer the readability of the character classes (and I know others here disagree), so I wanted to present you this option as well.
Fourth, I've added anchors around your pattern, so that there can be no invalid syntax outside of the ${...}.
Finally, how does the recursion work? (?n) is similar to a backreference \n, in that it refers to capturing group n (counted by opening parentheses from left to right). The difference is that a backreference tries to match again what was matched by group n, whereas (?n) applies the pattern again. That is (.)\1 matches any characters twice in a row, whereas (.)(?1) matches any character and then applies the pattern again, hence matching another arbitrary character. If you use one of those (?n) constructs within the nth group, you get recursion. (?0) or (?R) refers to the entire pattern. That is all the magic there is.
The above pattern applied to the input
'${abc.def.ghi.jkl(mno.pqr(stu.vwx))}'
will result in the captures
0 ${abc.def.ghi.jkl(mno.pqr(stu.vwx))}
1 abc.def.ghi.jkl(mno.pqr(stu.vwx))
2 abc
3 def.ghi.
4 jkl(mno.pqr(stu.vwx))
Note that there are a few differences to the outputs you actually expected:
0 is the entire match (and in this case just the input string again). PHP will always report this first, so you cannot get rid of it.
1 is the first capturing group which encloses the recursive part. You don't need this in the output, but (?n) unfortunately cannot refer to non-capturing groups, so you need this as well.
2 is the class name as desired.
3 is the list of intermediate method names, plus a trailing period. Using explode it's easy to extract all the method names from this.
4 is the final method name, with the optional (recursive) argument list. Now you could take this, and apply the pattern again if necessary. Note that for a completely recursive approach you might want to modify the pattern slightly. That is: strip off the ${ and } in a separate first step, so that the entire pattern has the exact same (recursive) pattern as the final capture, and you can use (?0) instead of (?1). Then match, remove method name, and parentheses, and repeat, until you get no more parentheses in the last capture.
For more information on recursion, have a look at PHP's PCRE documentation.
To illustrate my last point, here is a snippet that extracts all elements recursively:
if(!preg_match('/^[$][{](.*)[}]$/', $expression, $matches))
echo 'Invalid syntax.';
else
traverseExpression($matches[1]);
function traverseExpression($expression, $level = 0) {
$pattern = '/^(([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))$/i';
if(preg_match($pattern, $expression, $matches)) {
$indent = str_repeat(" ", 4*$level);
echo $indent, "Class name: ", $matches[2], "<br />";
foreach(explode(".", $matches[3], -1) as $method)
echo $indent, "Method name: ", $method, "<br />";
$parts = preg_split('/[()]/', $matches[4]);
echo $indent, "Method name: ", $parts[0], "<br />";
if(count($parts) > 1) {
echo $indent, "With arguments:<br />";
traverseExpression($parts[1], $level+1);
}
}
else
{
echo 'Invalid syntax.';
}
}
Note again, that I do not recommend using the pattern as a one-liner, but this answer is already long enough.
you can do validation and extraction with the same pattern, example:
$subjects = array(
'${classA.methodA.methodB(classB.methodC(classB.methodD))}',
'${classA.methodA.methodB}',
'${classA.methodA.methodB()}',
'${methodB(methodC(classB.methodD))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE)))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE())))}'
);
$pattern = <<<'LOD'
~
# definitions
(?(DEFINE)(?<vn>[a-z]\w*+))
# pattern
^\$\{
(?<classA>\g<vn>)\.
(?<methodA>\g<vn>)\.
(?<methodB>
\g<vn> (
\( \g<vn> \. \g<vn> (?-1)?+ \)
)?+
)
}$
~x
LOD;
foreach($subjects as $subject) {
echo "\n\nsubject: $subject";
if (preg_match($pattern, $subject, $m))
printf("\nclassA: %s\nmethodA: %s\nmethodB: %s",
$m['classA'], $m['methodA'], $m['methodB']);
else
echo "\ninvalid string";
}
Regex explanation:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
At the end of the pattern you can see the modifier x that allow spaces, newlines and commentary inside the pattern.
First the pattern begin with the definition of a named group vn (variable name), here you can define how classA or methodB looks like for all the pattern. Then you can refer to this definition in all the pattern with \g<vn>
Note that you can define if you want different type of name for classes and method adding other definitions. Example:
(?(DEFINE)(?<cn>....)) # for class name
(?(DEFINE)(?<mn>....)) # for method name
The pattern itself:
(?<classA>\g<vn>) capture in the named group classA with the pattern defined in vn
same thing for methodA
methodB is different cause it can contain nested parenthesis, it's the reason why i use a recursive pattern for this part.
Detail:
\g<vn> # the method name (methodB)
( # open a capture group
\( # literal opening parenthesis
\g<vn> \. \g<vn> # for classB.methodC⑴
(?-1)?+ # refer the last capture group (the actual capture group)
# one or zero time (possessive) to allow the recursion stop
# when there is no more level of parenthesis
\) # literal closing parenthesis
)?+ # close the capture group
# one or zero time (possessive)
# to allow method without parameters
⑴you can replace it by \g<vn>(?>\.\g<vn>)+ if you want to allow more than one method.
About possessive quantifiers:
You can add + after a quantifier ( * + ? ) to make it possessive, the advantage is that the regex engine know that it don't have to backtrack to test other ways to match with a subpattern. The regex is then more efficient.
Description
This expression will match and capture only ${classA.methodA.methodB(classB.methodC(classB.methodD)))} or ${classA.methodA.methodB} formats.
(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)
Groups
Group 0 gets the entire match from the start dollar sign to the close squiggly bracket
gets the Class
gets the first method
gets the second method followed by all the text upto but not including the close squiggly bracket. If this group has open round brackets which are empty () then this match will fail
PHP Code Example:
<?php
$sourcestring="${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
${classA2.methodA2.methodB2}
${classA3.methodA3.methodB3()}
${methodB4(methodC4(classB4.methodD)))}
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}";
preg_match_all('/(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
[1] =>
${classA2.methodA2.methodB2}
[2] =>
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}
)
[1] => Array
(
[0] => classA1
[1] => classA2
[2] => classA5
)
[2] => Array
(
[0] => methodA1
[1] => methodA2
[2] => methodA5
)
[3] => Array
(
[0] => methodB1(classB.methodC(classB.methodD)))
[1] => methodB2
[2] => methodB5(classB.methodC(classB.methodD)))
)
)
Disclaimers
I added a number to the end of the class and method names to help illistrate what's happening in the groups
The sample text provided in the OP does not have balanced open and close round brackets.
Although () will be disallowed (()) will be allowed