Regex two separate nested capturing groups

Regex two separate nested capturing groups - php

I have a regex and test case on
https://regex101.com/r/5Z5Lop/1
^(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s*(?<DATA>.*?)\s*(?:L[:;'\s]\s*\K(?<LINE_DATA>.*?))?(?<INITIALS>\*[a-zA-Z]+)?\s*$
See the LINE_DATA named group.
Is it possible to split that group up into two separate groups?
I want one group LINE_NUMBERS to hold all integers not contained in parentheses.
Then, 1 group called QTYS to hold all integers that are contained in parentheses.
So currently LINE_NUMBERS yields "1,2,3(4),5(12) "
Is it possible to have a LINE_NUMBERS be [1,2,3,4] (either array or some kinda string)
and then QTYS to be [(4),(12)] Note: I do still want to capture the parentheses.
I would like to do this in the current regex if it's possible and doesn't overly complicate what I currently have.
Right now, I'm obtaining this data through post-processing with separate regexes. I'm using php
preg_match_all('/\d+(?!\s*\))/i', $ret_data['LINE_DATA'], $ret_data['LINE_NUMBERS']);
Thanks!
preg_match_all('/\(\s*\d\s*\)/i', $ret_data['LINE_DATA'], $ret_data['QUANTITIES']);

You can use a single pattern in the post-processing for the QUANTITIES and the LINE_NUMBERS using an alternation | and removing the empty entries from the result.
$re = '/^(?<KEY>CONF|ESD|TRACKING)[:;\'\s]\s*(?<DATA>.*?)\s*(?:L[:;\'\s]\s*\K(?<LINE_DATA>.*?))?(?<INITIALS>\*[a-zA-Z]+)?\s*$/i';
$str = 'esd: here is my data L: 1,2,3(4),5(12) *sm ';
preg_match($re, $str, $matches);
preg_match_all('/(?<QUANTITIES>\(\d+\))|(?<LINE_NUMBERS>\d+)/', $matches["LINE_DATA"], $numbers);
print_r(array_filter($numbers["QUANTITIES"]));
print_r(array_filter($numbers["LINE_NUMBERS"]));
Output
Array
(
[3] => (4)
[5] => (12)
)
Array
(
[0] => 1
[1] => 2
[2] => 3
[4] => 5
)
There could be an option to use the \G anchor to get 2 separate groups for the given example data, but it will make the INITIALS part after it optional:
^(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s*(?<DATA>.*?)\s*L[:;'\s]\s*|\G(?!^)(?:(?<QUANTITIES>\(\d+\))|(?<LINE_NUMBERS>\d+)),?(?:\s*(?<INITIALS>\*[a-zA-Z]+)\s*$)?
^ Start of string
(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s* The KEY group with alternatives, and match a single char listed in the character class and optional whitspace chars
(?<DATA>.*?)\s* Match the DATA group, any char non greedy followed by optional whitespace chars
L[:;'\s]\s* Match L the any of the list chars and optional whitespace chars
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
(?: Non capture group
(?<QUANTITIES>\(\d+\)) Group QUANTITIES, match 1+ digits between parenthesis
| Or
(?<LINE_NUMBERS>\d+) Group LINE_NUMBERS, match 1+ digits
) Close non capture group
,? Match an optional comma
(?:\s*(?<INITIALS>\*[a-zA-Z]+)\s*$)? Optional non capture group with group INITIALS
Regex demo | PHP demo

Related

Regex to get all numbers after a character

I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!

The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.

Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)

How can I extract values that have opening and closing brackets with regular expression?

I am trying to extract [[String]] with regular expression. Notice how a bracket opens [ and it needs to close ]. So you would receive the following matches:
[[String]]
[String]
String
If I use \[[^\]]+\] it will just find the first closing bracket it comes across without taking into consideration that a new one has opened in between and it needs the second close. Is this at all possible with regular expression?
Note: This type can either be String, [String] or [[String]] so you don't know upfront how many brackets there will be.

You can use the following PCRE compliant regex:
(?=((\[(?:\w++|(?2))*])|\b\w+))
See the regex demo. Details:
(?= - start of a positive lookahead (necessary to match overlapping strings):
(- start of Capturing group 1 (it will hold the "matches"):
(\[(?:\w++|(?2))*]) - Group 2 (technical, used for recursing): [, then zero or more occurrences of one or more word chars or the whole Group 2 pattern recursed, and then a ] char
| - or
\b\w+ - a word boundary (necessary since all overlapping matches are being searched for) and one or more word chars
) - end of Group 1
) - end of the lookahead.
See the PHP demo:
$s = "[[String]]";
if (preg_match_all('~(?=((\[(?:\w++|(?2))*])|\b\w+))~', $s, $m)){
print_r($m[1]);
}
Output:
Array
(
[0] => [[String]]
[1] => [String]
[2] => String
)

Regex pattern for splitting BEM string into parts (PHP)

I would like to isolate the block, element and modifier parts of a string via PHP regex. The flavour of BEM I'm using is lowercase and hyphenated. For example:
this-defines-a-block__this-defines-an-element--this-defines-a-modifier
My string is always formatted as the above, so the regex does not need to filter out any invalid BEM, for example, I will never have dirty strings such as:
This.defines-a-block__this-Defines-an-ELEMENT--090283
Block, Element and Modifier names could contain numbers, so we could have any combination of the following:
this-is-block-001__this-is-element-001--modifier-002
Finally a modifier is optional so not every string will have one for example:
this-is-a-block-001__this-is-an-element
this-is-a-block-002__this-is-an-element--this-is-an-optional-modifier
I am looking for some regex to return each section of the BEM markup. Each string will be isolated and sent to the regex individually, not as a group or as multiline strings. The following sent individually:
# String 1
block__element--modifier
# String 2
block-one__element-one--modifier-one
# String 3
block-one-big__element-one-big--modifier-one-big
# String 4
block-one-001__element-one-001
Would return:
# String 1
block
element
modifier
# String 2
block-one
element-one
modifier-one
# String 3
block-one-big
element-one-big
modifier-one-big
# String 4
block-one-001
element-one-001

You could use 3 capturing groups and make the third one optional using the ?
As all 3 groups are lowercase, can contain numbers and use the hyphen as a delimiter you might use a character class [a-z0-9].
You could reuse the pattern for group 1 using (?1)
\b([a-z0-9]+(?:-[a-z0-9]+)*)__((?1))(?:--((?1)))?\b
Explanation
\b Word boundary
( First capturing group
[a-z0-9]+ Repeat 1+ times what is listed in the character class
(?:-[a-z0-9]+)* Repeat 0+ times matching - and 1+ times what is in the character class
) Close group 1
__ Match literally
((?1)) Capturing group 2, recurse group 1
(?: Non capturing group
-- Match literally
((?1)) Capture group 3, recurse group 1
)? Close non capturing group and make it optional
\b Word boundary
Regex demo
Or using named groups:
\b(?<block>[a-z0-9]+(?:-[a-z0-9]+)*)__(?<element>(?&block))(?:--(?<modifier>(?&block)))?\b
Regex demo

Double regex matches

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!

Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.

From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101

Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)

Capture two regular expression from string

I have strings containing this (where the number are integers representing the user id)
#[calbert](3)
#[username](684684)
I figured I need the following to get the username and user id
\((.*?)\)
and
\[(.*?)])
But is there a way to get both at once?
And PHP returns, is it possible to only get the result without the parenthesis (and brackets in the username case)
Array
(
[0] => (3)
[1] => 3
)

\[([^\]]*)\]|\(([^)]*)\)
Try this.See demo.You need to use | or operator.This provides regex engine to provide alternating capturing group if the first one fails.
https://regex101.com/r/tX2bH4/31
$re = "/\\[([^\\]]*)\\]|\\(([^)]*)\\)/im";
$str = " #[calbert](3)\n #[username](684684)";
preg_match_all($re, $str, $matches);
Or you can use ur own regex. \((.*?)\)|\[(.*?)\])

Through positive lookbehind and lookahead assertion.
(?<=\[)[^\]]*(?=])|(?<=\()\d+(?=\))
(?<=\[) Asserts that the match must be preceeded by [ character,
[^\]]* matches any char but not of ] zero or more times.
(?=]) Asserts that the match must be followed by ] symbol.
| Called logical OR operator used to combine two regexes.
DEMO

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex two separate nested capturing groups - php

Related

Regex to get all numbers after a character

How can I extract values that have opening and closing brackets with regular expression?

Regex pattern for splitting BEM string into parts (PHP)

Double regex matches

Capture two regular expression from string

Categories

Resources