Regex pattern construction dealing with recursion - php

I'm not sure if recursion is the correct way to characterize what's occurring in this pattern, but unfortunately I'm too new with regex to build something that will conform to how this pattern can vary and avoid nested groups.
So the pattern is basically defined as:
#param {item} {label}:{text} {labeln}:{textn}
where labeln and textn is some N instance of the label:text group.
So an example would be
/**
*
* #param name1 test1:this is text for test1 test2:this is text for test2
* #param name2 test3:this is text for test3 test4:this is text for test4 test5:this is text for test5
*
* /
Now ideally I'm trying to capture name1, test1:this is text for test1, and test2:this is text for test2 as matching groups. Same goes for the name2 line. Of course there can be many more examples of name1 and the psuedo "named parameters" can be varied, from none to many. +Edit: Colons would not be permitted within the label text since they're reserved as delimiters. Label is strictly alphanumeric, label would probably be restricted to a-zA-Z0-9_,'"-
First question is... is this a recursion problem or did I mischaracterize this?
Second question is... is it possible and if so, how can I achieve this?

Preface:
For the sake of explanation, I decided to clarify your "labels" by preceding them with a %. This can be any reserved symbol or other pattern that helps clear up what is a label/text:
/**
* #param variable_a %label:This is variable: a %required:true
* #param variable_b %required:false %pattern:/[a-zA-Z:]/
*/
Problem:
The problem with capturing repetitive patterns in regular expressions is you can't have an unknown amount of capture groups (i.e. you either need to match a global number of matches or capture a specific amount of groups in each match):
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
(?: (?# start non capturing group)
%(\w+): (?# capture the label)
([^%\n]+) (?# capture the text)
)+ (?# repeat the non-capturing group)
In this example, I put the label/text capturing code in a non-capturing and repeated (1+ times) group. This allows us to match the whole string, however only the last set of labels/texts are captured (since we only have 3 groups: variable, label, and text).
Straightforward Solution:
Instead of this, you can just match the whole string and then parse the label/text string after-the-fact:
(?# match the whole string)
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
(.*) (?# capture the labels/texts)
(?# parse the label/text string)
% (?# the start of a label)
(\w+) (?# capture label)
: (?# end of label)
([^%]+) (?# capture text)
Awesome Solution:
Finally, we can use some regular expression magic to do a global match of all label/text combinations. This means we will have a defined set of 3 capture groups (variable, label, text) and we'll have a variable amount of matches. I think this one is best to show and then explain, so here is the crazy awesome regex magic:
(?: (?# start non-capturing group)
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
| (?# OR)
\G (?# start back over from our last match)
) (?# end non-capturing group)
%(\w+): (?# capture the label)
([^%\n]+) (?# capture the text)
This one revolves around the PCRE magic of \G, which matches the end of the last match. So we start a non-capturing group that will contain the "prefix" of a #param definition. This will either match and capture the variable OR start over from the end of our last match. Then we match/capture 1 label/text group. Next time it is repeated, we will start where we left off, the variable capture group will be blank (since it doesn't exist that far into the string, you'll have to use logic to know which variable you are on), and capture another label/text group (until we hit a new line, since I said a text can't be % or \n). Then the next match attempt will find a new variable defined by #param. I think this will be your best option, it just takes some more logic on your end.

Well, if you allow your middle label to contain a : but you don't allow it in your end label, I believe the below RegEx should work well enough:
#param\s+(.+?)\s+(.+:.+)\s+([^:]+:[^:]+)$
However, it won't work if your pattern spans multiple lines.
Also, if you're trying to parse PHPDoc or some variant thereof, you should write your own parser rather using RegEx.

Related

Why is non-greedy match consuming entire pattern even when followed by another non-greedy match

Using PHP8, I'm struggling to figure out how to conditionally match some key that may or may not appear in a string.
I would like to match both
-----------key=xyz---------------
AND
--------------------------
The dashes("-") could be any non-space character, and only used here for a cleaner to read example.
The regex is matching "key=..." if its containing group is greedy like below.
But this isn't adequate, because the full match will fail a "key=xyz" is missing the subject string.
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))
\S*
/x
if that capture group is non-greedy, then the regex just ignores the key match any "key=xyz"
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))?
\S*
/x
I tried debugging in this regex101 example but couldn't figure it out.
I sorted this out using multiple regexs, but hoping someone can help address my misunderstandings so I learn know how to make this work as a single regex.
Thanks
You may use:
/
^
\S*?
(?:
key=(?<foundkey>\w+)
\S*
)?
$
/xm
RegEx Demo
RegEx Breakdown:
^: Start
\S*?: Match 0 or more whitespaces non-greedy
(?:: Start Lookahead
key=(?<foundkey>\w+): Match key= text followed by 1+ word characters as capture group foundkey
\S*: Match 0 or more whitespaces
)?: End lookahead. ? makes it an optional match
$; End

Regex pattern for splitting BEM string into parts (PHP)

I would like to isolate the block, element and modifier parts of a string via PHP regex. The flavour of BEM I'm using is lowercase and hyphenated. For example:
this-defines-a-block__this-defines-an-element--this-defines-a-modifier
My string is always formatted as the above, so the regex does not need to filter out any invalid BEM, for example, I will never have dirty strings such as:
This.defines-a-block__this-Defines-an-ELEMENT--090283
Block, Element and Modifier names could contain numbers, so we could have any combination of the following:
this-is-block-001__this-is-element-001--modifier-002
Finally a modifier is optional so not every string will have one for example:
this-is-a-block-001__this-is-an-element
this-is-a-block-002__this-is-an-element--this-is-an-optional-modifier
I am looking for some regex to return each section of the BEM markup. Each string will be isolated and sent to the regex individually, not as a group or as multiline strings. The following sent individually:
# String 1
block__element--modifier
# String 2
block-one__element-one--modifier-one
# String 3
block-one-big__element-one-big--modifier-one-big
# String 4
block-one-001__element-one-001
Would return:
# String 1
block
element
modifier
# String 2
block-one
element-one
modifier-one
# String 3
block-one-big
element-one-big
modifier-one-big
# String 4
block-one-001
element-one-001
You could use 3 capturing groups and make the third one optional using the ?
As all 3 groups are lowercase, can contain numbers and use the hyphen as a delimiter you might use a character class [a-z0-9].
You could reuse the pattern for group 1 using (?1)
\b([a-z0-9]+(?:-[a-z0-9]+)*)__((?1))(?:--((?1)))?\b
Explanation
\b Word boundary
( First capturing group
[a-z0-9]+ Repeat 1+ times what is listed in the character class
(?:-[a-z0-9]+)* Repeat 0+ times matching - and 1+ times what is in the character class
) Close group 1
__ Match literally
((?1)) Capturing group 2, recurse group 1
(?: Non capturing group
-- Match literally
((?1)) Capture group 3, recurse group 1
)? Close non capturing group and make it optional
\b Word boundary
Regex demo
Or using named groups:
\b(?<block>[a-z0-9]+(?:-[a-z0-9]+)*)__(?<element>(?&block))(?:--(?<modifier>(?&block)))?\b
Regex demo

Regexp for matching all #property in PHPDoc

I am currently making regexp to extract all #property content in PHPDoc above class. I have no fields defined in class because i am using magic __get() method.
I have made (?<=#property)(.+(?=))((?<= ).+) regexp (Live example)
But i can't cut out spaces around first group so if i have something like:
* #property string external_order_id
With regex i presented above i will get:
Full match string external_order_id
Group 1. string (notice spaces around word)
Group 2. external_order_id (there is no unnecessary spaces)
I tried to use (?= ) inside first group (.+(?=)) but it stops regex after that match.
Can You help me with that? I would be thankful.
Since properties with no type may exist you have to do this:
(?m)#property *\K(?>(\S+) *)??(\S+)$
^^
(?m) sets multiline flag
(?>...) constructs an atomic group that's used for its non-capturing purpose.
?? ungreedy optional quantifier (you may use ? alone however)
$ asserts end of line
Live demo
For now that you are familiar with end of line assertion you can use \s with no worries:
(?m)#property\s*\K(?>(\S+)\s*)?(\S+)$
The whitespace lands in Group 1 since the .+ captures those spaces that appear right after the #property substring (the lookbehind (?<=#property) finds a location right after #property, and . matches any chars but line break chars).
Besides, (?=) is redundant as it requires an empty string right after the current location, i.e. "does nothing".
You may use
(?m)#property\h*\K(?:(\S+)\h+)?(\S+)$
See the regex demo.
Pattern details
(?m) - a multiline modifier making $ match the end of a line rather than a whole string
#property - a literal substring
\h* - 0+ horizontal whitespaces
\K - a match reset operator that discards all text matched so far from the Group 0 buffer
(?: - matches a sequence of patterns:
(\S+) - Group 1: one or more non-whitespace chars
\h+ - 1+ horizontal whitespaces
)? - 1 or 0 times
(\S+) - Group 2: one or more non-whitespace chars
$ - end of a line.

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

Parsing balanced nested wiki templates and extract a single line parameter's content by a regexp

I know parsing nested strings or HTML is better done by a real parser but in my case I have simple templates and wanted to extract the title content of a Wiki parameter 'title' from a template. It took me a while to achieve this but thanks to the regex tool of Lars Olav Torvik (http://regex.larsolavtorvik.com/) and this user forum here I got to it. May be someone finds it useful. (We all want to contribute, he, won't we? ;-) The following code annotated with comments does the trick. I had to do it with look around assertions to get not two templates mixed together whe there is no title in one of them.
I'm not sure yet for the two questions in the regex comments—see (?# Questions: …)—if I understood the recursive part at (?R). Is it, that it obtains its content to check for from the outermost defined level, i.e. second regexp line \{\{ and last regexp line \}\}? Would that be correct? And what is the difference between ++ and + before the alternative of (?R) booth work equally, so it seems when tested.
The origninal wiki templates on a page (most simple):
$wikiTemplate = "
{{Templ1
| title = (1. template) title
}}
{{Templ2
| any parameter = something {{template}}
}}
{{Templ1
| title = (3. template) title
}}
";
The replacement:
$wikiTemplate = preg_replace(
array(
// tag all templates with START … END and add a TITLE-placeholder before
// and take care of balanced {{ … }} recursiveness
"#(?s) (?# switch to dotall match, i.e. also linebreaks )
\{\{ (?# find two {{ )
(?: (?# group 1 as a non-backreferenced match )
(?: (?# group 2 as a non-backreferenced match )
(?! (?# in group 1 anything but not {{ or }} )
\{\{
| (?# or )
\}\}
)
.
)++ (?# Question: what is the differenc between ++ and + here? )
| (?# or )
(?R) (?# is it recursive of what is defined in the outermost,
i.e. 2nd regexp line with \{\{ and last line with \}\}
Question: is that here understood correctly? )
)
* (?# zero or many times of the inner regexp defintions )
\}\} (?# find two }} )
#x",// x-extended → ignore white space in the pattern
// replace TITLE by single line content of title parameter
"#
(?<=TITLE) (?# TITLE must preceed the following linebreak but is not
backreferenced within \\0, i.e. the whole returned match)
([\n\r]+) (?#linebr in 1 may also described as . because of
s-modifier dotall)
(?: (?# start non-backreferenced match )
. (?# any character but not followed by START)
(?!START)
)+ (?# multiple times)
(?: (?# start non-backreferenced match )
\|\s*title\s*=\s* (?#find the parameter '| title = ')
)
([^\r\n]+) (?#get title now to \\2 but exclude the line break.
Note it is buggy when there is no line break )
(?: (?# start non-backreferenced match )
. (?# any character but not followed by END)
(?!END)
)
+ (?# multiple times)
. (?# any single character, e.g. the last because as all
stuff before captures anything not followed by END)
(?:END) (?#a not backreferenced END)
#msx", // m-multiline, s-dotall match also linebreaks,
// x-extended → ignore white space in the pattern
),
array(
"TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template
# replace the TITLE to TITLEtitle contentTITLE…
"\\2TITLE\\0",
),
$wikiTemplate
);
print_r($wikiTemplate);
The output is then with the titles tagged by TITLE above each template but only if there was a title:
TITLE(1. template) titleTITLE
START{{Templ1
| title = (1. template) title
}}END
TITLE
START{{Templ2
| any parameter = something {{template}}
}}END
TITLE(3. template) titleTITLE
START{{Templ1
| title = (3. template) title
}}END
Any inside for my questions regarding regexp understanding, or some improvements?
Thanks, Andreas.
++ is a possessive quantifier. If you append any repetition quantifier (+, *, {...}) with a + it gets possessive. That means that the regex engine will not backtrack and try less repetitions, once it leaves the repetition for the first time. So they basically make the repetition an atomic group. Sometimes this is an optimization and sometimes it actually makes a difference. You can do some very good reading here.
And about your second question yes (?R) will simply try to match full pattern again. For this there is a good article to be found in the PHP documentation of PCRE.
For your other questions, a better place to ask this might be on Code Review.

Categories