PHP preg_match_all subpattern names in a pattern - php

The task is pretty clear. In the input we have a variable regex pattern, which supposedly contains named subpatterns, and in the output we need to get an array of subpattern names:
function get_subpattern_names($any_input_pattern) {
// What pattern to use here?
$pattern_to_get_names = '/.../';
preg_match_all($pattern_to_get_names, $any_input_pattern, $matches);
return $matches;
}
So the question is what to use as $pattern_to_get_names in the function above?
For example:
get_subpattern_names('/(?P<name>\w+): (?P<digit>\d+)/');
should return:
array('name', 'digit');
P.S.: According to PCRE documentation subpattern names consist of up to 32 alphanumeric characters and underscores.
As we don't control the input pattern, we need to take into account all possible syntaxes of naming. According to PHP documentation they are:
(?P<name>pattern), (?<name>pattern) and (?'name'pattern).
We also need to take into account nested subpatterns, for example:
(?<name1>.*(?<name2>pattern).*).
There's no need to count duplicating names, to preserve the appearance order, or to get numerical, non-capturing or other types of subpatterns. Just list of names if present.

You may get a list of all valid named capture group names using
"~(?<!\\\\)(?:\\\\{2})*\(\?(?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})')~"
See the regex and an online PHP demo.
The point is to match an unescaped ( that is followed with a ? that is then followed with either P< or < and then has a group name pattern ending with > or ' followed with the group name pattern and then '.
$rx = "~(?<!\\\\)(?:\\\\{2})*\(\?(?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})')~";
$s = "(?P<name>\w+): (?<name2>\w+): (?'digit'\d+)";
preg_match_all($rx, $s, $res);
print_r($res[1]);
yields
Array
(
[0] => name
[1] => name2
[2] => digit
)
Pattern details
(?<!\\) - no \ immediately to the left of the current location
(?:\\\\)* - 0+ double backslashes (to allow any escaped backslash before ()
\( - a (
\? - a ?
(?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})') - a branch reset group:
P?<([_A-Za-z]\w{0,31})> - an optional P, <, a _ or an ASCII letter, 0 to 31 word chars (digits/letters/_) (captured into Group 1), and >
| - or
'([_A-Za-z]\w{0,31})' - ', a _ or an ASCII letter, 0 to 31 word chars (digits/letters/_) (also captured into Group 1), and then '
The group name patterns are all captured into Group 1, you just need to get $res[1].

Wiktor's solution does seem quite thorough, but here's what I came up with.
print_r(get_subpattern_names('/(?P<name>\w+): (?P<digit>\d+)/'));
function get_subpattern_names($input_pattern){
preg_match_all('/\?P\<(.+?)\>/i', $input_pattern, $matches);
return $matches[1];
}
This should work for most cases. More importantly, this is much more readable and self-explanatory.
Basically, I search for ?P< followed by (.+?) which translates to a non-greedy version of something in between the angular brackets. The function then just returns the first offset in the $matches array which points to the first set of parenthesis matched.

Related

Validate string to contain only qualifying characters and a specific optional substring in the middle

I'm trying to make a regular expression in PHP. I can get it working in other languages but not working with PHP.
I want to validate item names in an array
They can contain upper and lower case letters, numbers, underscores, and hyphens.
They can contain => as an exact string, not separate characters.
They cannot start with =>.
They cannot finish with =>.
My current code:
$regex = '/^[a-zA-Z0-9-_]+$/'; // contains A-Z a-z 0-9 - _
//$regex = '([^=>]$)'; // doesn't end with =>
//$regex = '~.=>~'; // doesn't start with =>
if (preg_match($regex, 'Field_name_true2')) {
echo 'true';
} else {
echo 'false';
};
// Field=>Value-True
// =>False_name
//Bad_name_2=>
Use negative lookarounds. Negative lookahead (?!=>) at the beginning to prohibit beginning with =>, and negative lookbehind (?<!=>) at the end to prohibit ending with =>.
^(?!=>)(?:[a-zA-Z0-9-_]+(=>)?)+(?<!=>)$
DEMO
There is absolutely no requirement for lookarounds here.
Anchors and an optional group will suffice.
Demo
/^[\w-]+(?:=>[\w-]+)?$/
^^^^^^^^^^^^^-- this whole non-capturing group is optional
This allows full strings consisting exclusively of [0-9a-zA-Z-] or split ONCE by =>.
The non-capturing group may occur zero or one time.
In other words, => may occur after one or more [\w-] characters, but if it does occur, it MUST be immediately followed by one or more [\w-] characters until the end of the string.
To cover some of the ambiguity in the question requirements:
If foo=>bar=>bam is valid, then use /^[\w-]+(?:=>[\w-]+)*$/ which replaces ? (zero or one) with * (zero or more).
If foo=>=>bar is valid then use /^[\w-]+(?:(?:=>)+[\w-]+)*$/ which replaces => (must occur once) with (?:=>)+ (substring must occur one or more times).
Well, your character ranges equal to \w, so you could use
^(?!=>)(?:(?!=>$)(?:[-\w]|=>))+$
This construct uses a "tempered greedy token", see a demo on regex101.com.
More shiny, complicated and surely over the top, you could use subroutines as in:
(?(DEFINE)
(?<chars>[-\w]) # equals to A-Z, a-z, 0-9, _, -
(?<af>=>) # "arrow function"
(?<item>
(?!(?&af)) # no af at the beginning
(?:(?&af)?(?&chars)++)+
(?!(?&af)) # no af at the end
)
)
^(?&item)$
See another demo on regex101.com
For the example data, you can use
^[a-zA-Z0-9_-]+=>[a-zA-Z0-9_-]+$
The pattern matches:
^ Start of string
[a-zA-Z0-9_-]+ Match 1+ times any of the listed ranges or characters (can not start with =>)
=> Match literally
[a-zA-Z0-9_-]+ Match again 1+ times any of the listed ranges or characters
$ End of string
Regex demo
If you want to allow for optional spaces:
^\h*[a-zA-Z0-9_-]+\h*=>\h*[a-zA-Z0-9_-]+\h*$
Regex demo
Note that [a-zA-Z0-9_-] can be written as [\w-]

Regex Preg_match for licence key 25 alphanumeric and 4 hyphens

I'm still trying to get to grips with regex patterns and just after a little double-checking if someone wouldn't mind obliging!
I have a string which should either contain:
A 10 digit (numbers and letters) licence key, for example: 1234567890 OR
A 25 digit (numbers and letters) licence key, for example: ABCD1EFGH2IJKL3MNOP4QRST5 OR
A 29 digit licence number (25 numbers and letters, separated into 5 group by hyphens), for example: ABCD1-EFGH2-IJKL3-MNOP4-QRST51
I can match the first two fine, using ctype_alnum and strlen functions. However, for the last one I think I'll need to use regex and preg_match.
I had a go over at regex101.com and came up with the following:
preg_match('^([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})', $str);
Which seems to match what I'm looking for.
I want the string to only contain an exact match for a string beginning with the licence number, and contain nothing other than mixed upper/lower case letters and numbers in any order and hyphens between each group of 5 characters (so a total of 29 characters - I don't want any further matches). No white space, no other characters and nothing else before or after the 29 digit key.
Will the above work, without allowing any other combinations? Will it stop checking at 29 characters? I'm not sure if there is a simpler way to express this in regex?
Thanks for your time!
The main point is that you need to use both ^ (start of string) and $ (end of string) anchors. Also, when you use + after (...), you allow 1 or more repetitions of the whole subpattern inside the (...). So, you need to remove the +s and add the $ anchor. Also, you need regex delimiters for your regex to work in PHP preg_match. I prefer ~ so as not to escape /. Maybe it is not the case here, but this is a habit.
So, the regex can look like
'~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~'
See the regex demo
The (?:-[A-Za-z0-9]{5}){4} matches 4 occurrences of -[A-Za-z0-9]{5} subpattern. The (?:...) is a non-capturing group whose matched text does not get stored in any buffer (unlike the capturing group).
See the IDEONE demo:
$re = '~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~';
$str = "ABCD1-EFGH2-IJKL3-MNOP4-QRST5";
if (preg_match($re, $str, $matches)) {
echo "Matched!";
}
How about:
preg_match('/^([a-z0-9]{5})(?:-(?1)){4}$/i', $str);
Explanation:
/ : regex delimiter
^ : begining of string
( : begin group 1
[a-z0-9]{5} : exactly 5 alphanum.
) : end of group 1
(?: : begin NON capture group
- : a dash
(?1) : same as definition in group 1 (ie. [a-z0-9]{5})
){4} : this group must be repeated 4 times
$ : end of string
/i : regex delimiter with case insensitive modifier

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%
You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$
If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.
You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.
You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo
And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.
This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

Capture two regular expression from string

I have strings containing this (where the number are integers representing the user id)
#[calbert](3)
#[username](684684)
I figured I need the following to get the username and user id
\((.*?)\)
and
\[(.*?)])
But is there a way to get both at once?
And PHP returns, is it possible to only get the result without the parenthesis (and brackets in the username case)
Array
(
[0] => (3)
[1] => 3
)
\[([^\]]*)\]|\(([^)]*)\)
Try this.See demo.You need to use | or operator.This provides regex engine to provide alternating capturing group if the first one fails.
https://regex101.com/r/tX2bH4/31
$re = "/\\[([^\\]]*)\\]|\\(([^)]*)\\)/im";
$str = " #[calbert](3)\n #[username](684684)";
preg_match_all($re, $str, $matches);
Or you can use ur own regex. \((.*?)\)|\[(.*?)\])
Through positive lookbehind and lookahead assertion.
(?<=\[)[^\]]*(?=])|(?<=\()\d+(?=\))
(?<=\[) Asserts that the match must be preceeded by [ character,
[^\]]* matches any char but not of ] zero or more times.
(?=]) Asserts that the match must be followed by ] symbol.
| Called logical OR operator used to combine two regexes.
DEMO

parse search string for phrases and keywords

i need to parse a search string for keywords and phrases in php, for example
string 1: value of "measured response" detect goal "method valuation" study
will yield: value,of,measured reponse,detect,goal,method valuation,study
i also need it to work if the string has:
no phrases enclosed in quotes,
any number of phrases encloses in quotes with any number of keywords outside the quotes,
only phrases in quotes,
only space-separated keywords.
i'm leaning towards using preg_match with the pattern '/(\".*\")/' to get the phrases into an array, then remove the phrases from the string, then finally work the keywords into the array. i just can't pull everything together!
i'm also thinking of replacing spaces outside quotes with commas. then explode them to an array. if that's a better option, how do i do that with preg_replace?
is there a better way to go about this? help! thanks much, everyone
preg_match_all('/(?<!")\b\w+\b|(?<=")\b[^"]+/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
# Matched text = $result[0][$i];
}
This should yield the results you are looking for.
Explanation :
# (?<!")\b\w+\b|(?<=")\b[^"]+
#
# Match either the regular expression below (attempting the next alternative only if this one fails) «(?<!")\b\w+\b»
# Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «(?<=")\b[^"]+»
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match any character that is NOT a “"” «[^"]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
There is no need to use a regular expression, the built in function str_getcsv can be used to explode a string with any given delimiter, enclosure and escape characters.
Really it is as simple as.
// where $string is the string to parse
$array = str_getcsv($string, ' ', '"');
$s = 'value of "measured response" detect goal "method valuation" study';
preg_match_all('~(?|"([^"]+)"|(\S+))~', $s, $matches);
print_r($matches[1]);
output:
Array
(
[0] => value
[1] => of
[2] => measured response
[3] => detect
[4] => goal
[5] => method valuation
[6] => study
)
The trick here is to use a branch-reset group: (?|...|...). It's just like an alternation contained in a non-capturing group - (?:...|...) - except that within each branch the capturing-group numbers start at the same number. (For more info, see the PCRE docs and search for DUPLICATE SUBPATTERN NUMBERS.)
Thus, the text we're interested in is always captured group #1. You can retrieve the contents of group #1 for all matches via $matches[1]. (That's assuming the PREG_PATTERN_ORDER flag is set; I didn't specify it like #FailedDev did because it's the default. See the PHP docs for details.)

Categories