I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!
Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101
Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)
Related
I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!
The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.
Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)
I have a regex and test case on
https://regex101.com/r/5Z5Lop/1
^(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s*(?<DATA>.*?)\s*(?:L[:;'\s]\s*\K(?<LINE_DATA>.*?))?(?<INITIALS>\*[a-zA-Z]+)?\s*$
See the LINE_DATA named group.
Is it possible to split that group up into two separate groups?
I want one group LINE_NUMBERS to hold all integers not contained in parentheses.
Then, 1 group called QTYS to hold all integers that are contained in parentheses.
So currently LINE_NUMBERS yields "1,2,3(4),5(12) "
Is it possible to have a LINE_NUMBERS be [1,2,3,4] (either array or some kinda string)
and then QTYS to be [(4),(12)] Note: I do still want to capture the parentheses.
I would like to do this in the current regex if it's possible and doesn't overly complicate what I currently have.
Right now, I'm obtaining this data through post-processing with separate regexes. I'm using php
preg_match_all('/\d+(?!\s*\))/i', $ret_data['LINE_DATA'], $ret_data['LINE_NUMBERS']);
Thanks!
preg_match_all('/\(\s*\d\s*\)/i', $ret_data['LINE_DATA'], $ret_data['QUANTITIES']);
You can use a single pattern in the post-processing for the QUANTITIES and the LINE_NUMBERS using an alternation | and removing the empty entries from the result.
$re = '/^(?<KEY>CONF|ESD|TRACKING)[:;\'\s]\s*(?<DATA>.*?)\s*(?:L[:;\'\s]\s*\K(?<LINE_DATA>.*?))?(?<INITIALS>\*[a-zA-Z]+)?\s*$/i';
$str = 'esd: here is my data L: 1,2,3(4),5(12) *sm ';
preg_match($re, $str, $matches);
preg_match_all('/(?<QUANTITIES>\(\d+\))|(?<LINE_NUMBERS>\d+)/', $matches["LINE_DATA"], $numbers);
print_r(array_filter($numbers["QUANTITIES"]));
print_r(array_filter($numbers["LINE_NUMBERS"]));
Output
Array
(
[3] => (4)
[5] => (12)
)
Array
(
[0] => 1
[1] => 2
[2] => 3
[4] => 5
)
There could be an option to use the \G anchor to get 2 separate groups for the given example data, but it will make the INITIALS part after it optional:
^(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s*(?<DATA>.*?)\s*L[:;'\s]\s*|\G(?!^)(?:(?<QUANTITIES>\(\d+\))|(?<LINE_NUMBERS>\d+)),?(?:\s*(?<INITIALS>\*[a-zA-Z]+)\s*$)?
^ Start of string
(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s* The KEY group with alternatives, and match a single char listed in the character class and optional whitspace chars
(?<DATA>.*?)\s* Match the DATA group, any char non greedy followed by optional whitespace chars
L[:;'\s]\s* Match L the any of the list chars and optional whitespace chars
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
(?: Non capture group
(?<QUANTITIES>\(\d+\)) Group QUANTITIES, match 1+ digits between parenthesis
| Or
(?<LINE_NUMBERS>\d+) Group LINE_NUMBERS, match 1+ digits
) Close non capture group
,? Match an optional comma
(?:\s*(?<INITIALS>\*[a-zA-Z]+)\s*$)? Optional non capture group with group INITIALS
Regex demo | PHP demo
I have strings containing this (where the number are integers representing the user id)
#[calbert](3)
#[username](684684)
I figured I need the following to get the username and user id
\((.*?)\)
and
\[(.*?)])
But is there a way to get both at once?
And PHP returns, is it possible to only get the result without the parenthesis (and brackets in the username case)
Array
(
[0] => (3)
[1] => 3
)
\[([^\]]*)\]|\(([^)]*)\)
Try this.See demo.You need to use | or operator.This provides regex engine to provide alternating capturing group if the first one fails.
https://regex101.com/r/tX2bH4/31
$re = "/\\[([^\\]]*)\\]|\\(([^)]*)\\)/im";
$str = " #[calbert](3)\n #[username](684684)";
preg_match_all($re, $str, $matches);
Or you can use ur own regex. \((.*?)\)|\[(.*?)\])
Through positive lookbehind and lookahead assertion.
(?<=\[)[^\]]*(?=])|(?<=\()\d+(?=\))
(?<=\[) Asserts that the match must be preceeded by [ character,
[^\]]* matches any char but not of ] zero or more times.
(?=]) Asserts that the match must be followed by ] symbol.
| Called logical OR operator used to combine two regexes.
DEMO
I am trying to match words containing the following: eph gro iss
I have eph|gro|iss which will match eph gro iss in this example: new grow miss eph.
However I need to match the whole word. For example it should match all of the miss not just iss and grow not just gro
Thanks
You can do it like this:
\b(\w*(eph|gro|iss)\w*)\b
How it works:
The expression is bracketed with word-boundary anchors \b, so it only matches whole words. These words must contain one of the literals eph, gro or iss somewhere, but the \w* parts allow the literals to appear anywhere within the whole word.
The important thing here is that you need to adopt some specific definition for "words". If you are OK with the regex definition that words are sequences that match [a-zA-Z0-9_]+ then you can use the above verbatim.
If your definition of word is something else, you will need to replace the \b anchors and \w classes appropriately.
Try this:
\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b
Breakdown:
\b - word boundary
( - start capture
[a-zA-Z]* - zero or more letters
(?:eph|gro|iss) - your original regex, non-capturing
[a-zA-Z]* - zero or more letters
) - end capture
\b - word boundary
Example output:
php > $string = "new grow miss eph";
php > preg_match_all("/\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b/", $string, $matches);
php > print_r($matches);
Array
(
[0] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
[1] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
)
Is it possible to use quantifiers with groups?
For example. I want to match something like:
11%
09%
aa%
zy%
g1%
8b%
...
The pattern is: 2 letters or numbers (mixed, or not) and a % ending the string ...
<?php
echo preg_match('~^([a-z]+[0-9]+){2}%$~', 'a1%'); // 0, I expect 1.
I know, this example doesn't make too much sense. A simple [list]{m,n} would solve this one. It's as simples as possible just to get an answer.
You sure can apply quantifiers to groups. For example, I have the string:
HouseCatMouseDog
And I have the regex:
(Mouse|Cat|Dog){n}
Where n is any number. You can play around changing the value of n here.
As for your example (yes, [list]{m,n} would be simpler), it will work only if there is an alphabet or more, followed by a number, or more. As such, only g1 will match.
You don't need use 2 characters classes, only one would do your job.
echo preg_match('~^([a-z0-9]{2})%$~', 'a1%');
RegExp meaning
^ => It will match at beggining of the string/line
(
[a-z0-9] => Will match every single character that match a-z(abcdefghijklmnopqrstuvwxyz) class and 0-9(0123456789) class.
{2} => rule above must be true 2 times
) => Capture block
% => that character must be matched after a-z and 0-9 classes
$ => end of string/line