Regex pattern for splitting BEM string into parts (PHP) - php

I would like to isolate the block, element and modifier parts of a string via PHP regex. The flavour of BEM I'm using is lowercase and hyphenated. For example:
this-defines-a-block__this-defines-an-element--this-defines-a-modifier
My string is always formatted as the above, so the regex does not need to filter out any invalid BEM, for example, I will never have dirty strings such as:
This.defines-a-block__this-Defines-an-ELEMENT--090283
Block, Element and Modifier names could contain numbers, so we could have any combination of the following:
this-is-block-001__this-is-element-001--modifier-002
Finally a modifier is optional so not every string will have one for example:
this-is-a-block-001__this-is-an-element
this-is-a-block-002__this-is-an-element--this-is-an-optional-modifier
I am looking for some regex to return each section of the BEM markup. Each string will be isolated and sent to the regex individually, not as a group or as multiline strings. The following sent individually:
# String 1
block__element--modifier
# String 2
block-one__element-one--modifier-one
# String 3
block-one-big__element-one-big--modifier-one-big
# String 4
block-one-001__element-one-001
Would return:
# String 1
block
element
modifier
# String 2
block-one
element-one
modifier-one
# String 3
block-one-big
element-one-big
modifier-one-big
# String 4
block-one-001
element-one-001

You could use 3 capturing groups and make the third one optional using the ?
As all 3 groups are lowercase, can contain numbers and use the hyphen as a delimiter you might use a character class [a-z0-9].
You could reuse the pattern for group 1 using (?1)
\b([a-z0-9]+(?:-[a-z0-9]+)*)__((?1))(?:--((?1)))?\b
Explanation
\b Word boundary
( First capturing group
[a-z0-9]+ Repeat 1+ times what is listed in the character class
(?:-[a-z0-9]+)* Repeat 0+ times matching - and 1+ times what is in the character class
) Close group 1
__ Match literally
((?1)) Capturing group 2, recurse group 1
(?: Non capturing group
-- Match literally
((?1)) Capture group 3, recurse group 1
)? Close non capturing group and make it optional
\b Word boundary
Regex demo
Or using named groups:
\b(?<block>[a-z0-9]+(?:-[a-z0-9]+)*)__(?<element>(?&block))(?:--(?<modifier>(?&block)))?\b
Regex demo

Related

regex (regular expression): parse the optional end of the string if it exists

Please help me to find the end of the string and parse it.
I have the string like this "Some words" and "Some another words||MyTag". Please help me with regex to check any of this strings. But in the second case please extract "MyTag". So please make two groups: "Some words" or "Some another words" and MyTag or empty string.
I tried "^([\W\w]+)(?:||(.*))?" but without any success.
You can create 2 capture groups:
^(.+?)(?:\|\|(.+))?$
^ Start of string
(.+?) Capture group 1 Match any character except newlines as least as possible
(?: Non capture group
\|\|(.+) Match || and capture 1+ chars in group 2
)? Close non capture group and make it optional
$ End of string
If the second part is not present, then there will be no group 2 value.
Regex demo
You might also consider to split on ||
Another option without a non greedy dot is to not allow matching |, only when it is not directly followed by | using a negative lookahead.
^([^\r\n|]*(?:\|(?!\|)[^\r\n|]*)*)(?:\|\|(.+))?$
Regex demo

Maximum character length for PHP multiline regular expressions?

I'm trying to evaluate a multiline RegExp with preg_match_all.
Unfortunately there seems to be a character limit around 24,000 characters (24,577 to be specific).
Does anyone know how to get this to work?
Pseudo-code:
<?php
$data = 'TRACE: aaaa(24,577 characters)';
preg_match_all('/([A-Z]+): ((?:(?![A-Z]+:).)*)\n/s', $data, $matches);
var_dump($matches);
?>
Working example (with < 24,577 characters): https://3v4l.org/8iRCc
Example that's NOT working (with > 24,577 characters): https://3v4l.org/ceKn6
You might rewrite the pattern using a negated character class instead of the tempered greedy token approach with the negative lookahead:
([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
([A-Z]+): Capture group 1, match 1+ uppercase chars : and a space
( Capture group 2
[^A-Z\r\n]* Match 1+ times any char except A-Z or a newline
(?> Atomic group
(?: Non capture group
\r?\n Match a newline
| Or
[A-Z] Match a char other than A-Z
(?![A-Z]*:) Negative lookahead, assert not optional chars A-Z and :
) Close non capture group
[^A-Z\r\n]* Optionally match any char except A-Z
)* Close atomic group and optionally repeat
)\r?\n Close group 2 and match a newline
Regex demo | Php demo
If the TRACE: is at the start of the string, you can also add an anchor:
^([A-Z]+): ([^A-Z\r\n]*(?>(?:\r?\n|[A-Z](?![A-Z]*:))[^A-Z\r\n]*)*)\r?\n
Regex demo
Edit
If the strings start with the same format, you can capture and match all lines that do not start with the opening format.
^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)
The pattern matches:
^ Start of string
([A-Z]+): Capture group 1
( Capture group 2
.* Match the rest of the line
(?:\r?\n(?![A-Z]+: ).*)* Repeat matching all lines that do not start with the pattern [A-Z]+:
) Close group 2
Regex demo
In php you can use
$re = '/^([A-Z]+): (.*(?:\r?\n(?![A-Z]+: ).*)*)/m';
Php demo
Try this
preg_match('/\A(?>[^\r\n]*(?>\r\n?|\n)){0,4}[^\r\n]*\z/',$data)

I'm trying to capture data in a web url with regex

I'm trying to build my regex to match my urls
Here are 2 example urls
category/sorganiser/bouger/escalade/offre/78934/
category/sorganiser/savourer/offre/8040/
I would like to get the number just after offre (78934 and 8040)
as well as the word just before the word offre (escalade and savourer)
I did several tests but did not pass
^category/(((\w)+/){1,3})(\d+)/?$
^category/(((\w)+/){1,3})/offre/(\d+)/?$
https://regex101.com/r/S4MTvK/1
Thank you
Instead of repeating a single word char in a group (\w)+ you can repeat 1+ word chars in a single group (\w+)
Note to not match the / before /offre as it is already matched in the iteration ^category/(?:(\w+)/){1,3}
You can repeat the capture group inside a non capture group (?: to capture the last occurrence in the iteration.
^category/(?:(\w+)/){1,3}offre/(\d+)
The pattern matches
^ Start of string
category/ Match literally
(?: Non capture group
(\w+)/ Capture group 1, match 1+ word chars and match /
){1,3} Close non capture, repeat 1-3 times and capture group 1 contains the last occurrence of 1+ word chars which is escalade or savourer
offre/ Match literally
(\d+) Capture group 2, match 1+ digits
Regex demo
To also match an optional / before the end of the sting
^category/(?:(\w+)/){1,3}offre/(\d+)/?$
Regex demo

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

Regexp for matching all #property in PHPDoc

I am currently making regexp to extract all #property content in PHPDoc above class. I have no fields defined in class because i am using magic __get() method.
I have made (?<=#property)(.+(?=))((?<= ).+) regexp (Live example)
But i can't cut out spaces around first group so if i have something like:
* #property string external_order_id
With regex i presented above i will get:
Full match string external_order_id
Group 1. string (notice spaces around word)
Group 2. external_order_id (there is no unnecessary spaces)
I tried to use (?= ) inside first group (.+(?=)) but it stops regex after that match.
Can You help me with that? I would be thankful.
Since properties with no type may exist you have to do this:
(?m)#property *\K(?>(\S+) *)??(\S+)$
^^
(?m) sets multiline flag
(?>...) constructs an atomic group that's used for its non-capturing purpose.
?? ungreedy optional quantifier (you may use ? alone however)
$ asserts end of line
Live demo
For now that you are familiar with end of line assertion you can use \s with no worries:
(?m)#property\s*\K(?>(\S+)\s*)?(\S+)$
The whitespace lands in Group 1 since the .+ captures those spaces that appear right after the #property substring (the lookbehind (?<=#property) finds a location right after #property, and . matches any chars but line break chars).
Besides, (?=) is redundant as it requires an empty string right after the current location, i.e. "does nothing".
You may use
(?m)#property\h*\K(?:(\S+)\h+)?(\S+)$
See the regex demo.
Pattern details
(?m) - a multiline modifier making $ match the end of a line rather than a whole string
#property - a literal substring
\h* - 0+ horizontal whitespaces
\K - a match reset operator that discards all text matched so far from the Group 0 buffer
(?: - matches a sequence of patterns:
(\S+) - Group 1: one or more non-whitespace chars
\h+ - 1+ horizontal whitespaces
)? - 1 or 0 times
(\S+) - Group 2: one or more non-whitespace chars
$ - end of a line.

Categories