Regexp for matching all #property in PHPDoc - php

I am currently making regexp to extract all #property content in PHPDoc above class. I have no fields defined in class because i am using magic __get() method.
I have made (?<=#property)(.+(?=))((?<= ).+) regexp (Live example)
But i can't cut out spaces around first group so if i have something like:
* #property string external_order_id
With regex i presented above i will get:
Full match string external_order_id
Group 1. string (notice spaces around word)
Group 2. external_order_id (there is no unnecessary spaces)
I tried to use (?= ) inside first group (.+(?=)) but it stops regex after that match.
Can You help me with that? I would be thankful.

Since properties with no type may exist you have to do this:
(?m)#property *\K(?>(\S+) *)??(\S+)$
^^
(?m) sets multiline flag
(?>...) constructs an atomic group that's used for its non-capturing purpose.
?? ungreedy optional quantifier (you may use ? alone however)
$ asserts end of line
Live demo
For now that you are familiar with end of line assertion you can use \s with no worries:
(?m)#property\s*\K(?>(\S+)\s*)?(\S+)$

The whitespace lands in Group 1 since the .+ captures those spaces that appear right after the #property substring (the lookbehind (?<=#property) finds a location right after #property, and . matches any chars but line break chars).
Besides, (?=) is redundant as it requires an empty string right after the current location, i.e. "does nothing".
You may use
(?m)#property\h*\K(?:(\S+)\h+)?(\S+)$
See the regex demo.
Pattern details
(?m) - a multiline modifier making $ match the end of a line rather than a whole string
#property - a literal substring
\h* - 0+ horizontal whitespaces
\K - a match reset operator that discards all text matched so far from the Group 0 buffer
(?: - matches a sequence of patterns:
(\S+) - Group 1: one or more non-whitespace chars
\h+ - 1+ horizontal whitespaces
)? - 1 or 0 times
(\S+) - Group 2: one or more non-whitespace chars
$ - end of a line.

Related

Regex pattern for splitting BEM string into parts (PHP)

I would like to isolate the block, element and modifier parts of a string via PHP regex. The flavour of BEM I'm using is lowercase and hyphenated. For example:
this-defines-a-block__this-defines-an-element--this-defines-a-modifier
My string is always formatted as the above, so the regex does not need to filter out any invalid BEM, for example, I will never have dirty strings such as:
This.defines-a-block__this-Defines-an-ELEMENT--090283
Block, Element and Modifier names could contain numbers, so we could have any combination of the following:
this-is-block-001__this-is-element-001--modifier-002
Finally a modifier is optional so not every string will have one for example:
this-is-a-block-001__this-is-an-element
this-is-a-block-002__this-is-an-element--this-is-an-optional-modifier
I am looking for some regex to return each section of the BEM markup. Each string will be isolated and sent to the regex individually, not as a group or as multiline strings. The following sent individually:
# String 1
block__element--modifier
# String 2
block-one__element-one--modifier-one
# String 3
block-one-big__element-one-big--modifier-one-big
# String 4
block-one-001__element-one-001
Would return:
# String 1
block
element
modifier
# String 2
block-one
element-one
modifier-one
# String 3
block-one-big
element-one-big
modifier-one-big
# String 4
block-one-001
element-one-001
You could use 3 capturing groups and make the third one optional using the ?
As all 3 groups are lowercase, can contain numbers and use the hyphen as a delimiter you might use a character class [a-z0-9].
You could reuse the pattern for group 1 using (?1)
\b([a-z0-9]+(?:-[a-z0-9]+)*)__((?1))(?:--((?1)))?\b
Explanation
\b Word boundary
( First capturing group
[a-z0-9]+ Repeat 1+ times what is listed in the character class
(?:-[a-z0-9]+)* Repeat 0+ times matching - and 1+ times what is in the character class
) Close group 1
__ Match literally
((?1)) Capturing group 2, recurse group 1
(?: Non capturing group
-- Match literally
((?1)) Capture group 3, recurse group 1
)? Close non capturing group and make it optional
\b Word boundary
Regex demo
Or using named groups:
\b(?<block>[a-z0-9]+(?:-[a-z0-9]+)*)__(?<element>(?&block))(?:--(?<modifier>(?&block)))?\b
Regex demo

PHP regex: zero or more whitespace not working

I'm trying to apply a regex constraint to a Symfony form input. The requirement for the input is that the start of the string and all commas must be followed by zero or more whitespace, then a # or # symbol, except when it's the empty string.
As far as I can tell, there is no way to tell the constraint to use preg_match_all instead of just preg_match, but it does have the ability to negate the match. So, I need a regular expression that preg_match will NOT MATCH for the given scenario: any string containing the start of the string or a comma, followed by zero or more whitespace, followed by any character that is not a # or # and is not the end of the string, but will match for everything else. Here are a few examples:
preg_match(..., ''); // No match
preg_match(..., '#yolo'); // No match
preg_match(..., '#yolo, #swag'); // No match
preg_match(..., '#yolo,#swag'); // No match
preg_match(..., '#yolo, #swag,'); // No match
preg_match(..., 'yolo'); // Match
preg_match(..., 'swag,#yolo'); // Match
preg_match(..., '#swag, yolo'); // Match
I would've thought for sure that /(^|,)\s*[^##]/ would work, but it's failing in every case with 1 or more spaces and it appears to be because of the asterisk. If I get rid of the asterisk, preg_match('/(^|,)\s[^##]/', '#yolo, #swag') does not match (as desired) when there's exactly once space, but as as soon as I reintroduce the asterisk it breaks for any quantity of spaces > 0.
My theory is that the regex engine is interpreting the second space as a character that is not in the character set [##], but that's just a theory and I don't know what to do about it. I know that I could create a custom constraint to use preg_match_all instead to get around this, but I'd like to avoid that if possible.
You may use
'~(?:^|,)\s*+[^##]~'
Here, the + symbol defines a *+ possessive quantifier matching 0 or more occurrences of whitespace chars, and disallowing the regex engine to backtrack into \s* pattern if [^##] cannot match the subsequent char.
See the regex demo.
Details
(?:^|,) - either start of string or ,
\s*+ - zero or more whitespace chars, possessively matched (i.e. if the next char is not matched with [^##] pattern, the whole pattern match will fail)
[^##] - a negated character class matching any char but # and #.

PHP Regex display either abc or abc xyz format

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

Regex pattern construction dealing with recursion

I'm not sure if recursion is the correct way to characterize what's occurring in this pattern, but unfortunately I'm too new with regex to build something that will conform to how this pattern can vary and avoid nested groups.
So the pattern is basically defined as:
#param {item} {label}:{text} {labeln}:{textn}
where labeln and textn is some N instance of the label:text group.
So an example would be
/**
*
* #param name1 test1:this is text for test1 test2:this is text for test2
* #param name2 test3:this is text for test3 test4:this is text for test4 test5:this is text for test5
*
* /
Now ideally I'm trying to capture name1, test1:this is text for test1, and test2:this is text for test2 as matching groups. Same goes for the name2 line. Of course there can be many more examples of name1 and the psuedo "named parameters" can be varied, from none to many. +Edit: Colons would not be permitted within the label text since they're reserved as delimiters. Label is strictly alphanumeric, label would probably be restricted to a-zA-Z0-9_,'"-
First question is... is this a recursion problem or did I mischaracterize this?
Second question is... is it possible and if so, how can I achieve this?
Preface:
For the sake of explanation, I decided to clarify your "labels" by preceding them with a %. This can be any reserved symbol or other pattern that helps clear up what is a label/text:
/**
* #param variable_a %label:This is variable: a %required:true
* #param variable_b %required:false %pattern:/[a-zA-Z:]/
*/
Problem:
The problem with capturing repetitive patterns in regular expressions is you can't have an unknown amount of capture groups (i.e. you either need to match a global number of matches or capture a specific amount of groups in each match):
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
(?: (?# start non capturing group)
%(\w+): (?# capture the label)
([^%\n]+) (?# capture the text)
)+ (?# repeat the non-capturing group)
In this example, I put the label/text capturing code in a non-capturing and repeated (1+ times) group. This allows us to match the whole string, however only the last set of labels/texts are captured (since we only have 3 groups: variable, label, and text).
Straightforward Solution:
Instead of this, you can just match the whole string and then parse the label/text string after-the-fact:
(?# match the whole string)
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
(.*) (?# capture the labels/texts)
(?# parse the label/text string)
% (?# the start of a label)
(\w+) (?# capture label)
: (?# end of label)
([^%]+) (?# capture text)
Awesome Solution:
Finally, we can use some regular expression magic to do a global match of all label/text combinations. This means we will have a defined set of 3 capture groups (variable, label, text) and we'll have a variable amount of matches. I think this one is best to show and then explain, so here is the crazy awesome regex magic:
(?: (?# start non-capturing group)
#param (?# find a param)
\s* (?# whitespace)
(\w+) (?# capture the variable)
\s* (?# whitespace)
| (?# OR)
\G (?# start back over from our last match)
) (?# end non-capturing group)
%(\w+): (?# capture the label)
([^%\n]+) (?# capture the text)
This one revolves around the PCRE magic of \G, which matches the end of the last match. So we start a non-capturing group that will contain the "prefix" of a #param definition. This will either match and capture the variable OR start over from the end of our last match. Then we match/capture 1 label/text group. Next time it is repeated, we will start where we left off, the variable capture group will be blank (since it doesn't exist that far into the string, you'll have to use logic to know which variable you are on), and capture another label/text group (until we hit a new line, since I said a text can't be % or \n). Then the next match attempt will find a new variable defined by #param. I think this will be your best option, it just takes some more logic on your end.
Well, if you allow your middle label to contain a : but you don't allow it in your end label, I believe the below RegEx should work well enough:
#param\s+(.+?)\s+(.+:.+)\s+([^:]+:[^:]+)$
However, it won't work if your pattern spans multiple lines.
Also, if you're trying to parse PHPDoc or some variant thereof, you should write your own parser rather using RegEx.

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

Categories