I've been trying to use a regular expression to match and extract parts of a URL.
The URL pattern looks like:
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
I intend to capture the following groups:
match and capture xyz (optional, but specific value)
match and capture fe/fi/fo5/fu2m (must exist, arbitrary value)
match and capture 123 (optional numeric value, which must appear at the end)
Here are expressions I have tried and problem encountered:
string1: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
string2: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))\/$
makes number at end mandatory
matches and captures all groups as required in string1 even when xyz is not included
no match in string2 because there's no number at the end
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))?\/$
makes number at end optional
captures only groups 1 and 2 in string1 and string2 . Number is matched along with group 2 in string2 as fe/fi/fo5/fu2m/123
My problem is how to capture groups 1, 2 and 3 in all scenarios incl. string1 and string2 (note: I am using PHP's preg_match function)
I will use parse_url first to extract the path from the url. Then all you have to do is to use a non-greedy quantifier in the second group :
$path = parse_url($url, PHP_URL_PATH);
if ( preg_match('~^\A/([^/]+)/(.*?)/(?:(\d+)/)?\z~', $path, $m) )
var_dump($m);
This way, if the number at the end is missing, the non-greedy quantifier (from the second group) is forced to reach the end of the string.
Use a modified URL validator.
'~^(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/(xyz))?((?:/(?!\d+/?$)[^/]*)+)(?:/(\d+))?/?\s*$~'
Group 1 is optional xyz
Group 2 is required middle
Group 3 is optional number at the end
Readable version
^
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?:
/
( xyz ) # Optional specific value
)?
( # Must exist, arbitrary value
(?:
/
(?! \d+ /? $ ) # Not a numeric value at the end
[^/]*
)+
)
(?:
/
( \d+ ) # Optional numeric value, which must appear at the end
)?
/?
\s*
$
Output
** Grp 0 - ( pos 0 : len 46 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
** Grp 1 - ( pos 21 : len 3 )
xyz
** Grp 2 - ( pos 24 : len 15 )
/fe/fi/fo5/fu2m
** Grp 3 - ( pos 40 : len 3 )
123
** Grp 0 - ( pos 48 : len 42 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
** Grp 1 - ( pos 69 : len 3 )
xyz
** Grp 2 - ( pos 72 : len 18 )
/fe/fi/fo5/fu2m/
** Grp 3 - NULL
Related
I have data in this format:
Randomtext1(random2, random4) Randomtext2 (ran dom) Randomtext3 Randomtext4 (random5,random7,random8) Randomtext5 (Randomtext4 (random5,random7,random8), random10) Randomtext11()
with this:
preg_match_all("/\b\w+\b(?:\s*\(.*?\)|)/",$text,$matches);
I obtain:
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)',
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8)',
5 => 'random10',
6 => 'Randomtext11()',
but I want
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)'
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8), random10)'
5 => 'Randomtext11()'
Any ideas?
You need a recursive pattern to handle nested parenthesis:
if ( preg_match_all('~\w+(?:\s*(\([^()]*+(?:(?1)[^()]*)*+\)))?~', $text, $matches) )
print_r($matches[0]);
demo
details:
~ # delimiter
\w+
(?:
\s*
( # capture group 1
\(
[^()]*+ # all that isn't a round bracket
# (possessive quantifier *+ to prevent too many backtracking
# steps in case of badly formatted string)
(?:
(?1) # recursion in the capture group 1
[^()]*
)*+
\)
) # close the capture group 1
)? # to make the group optional (instead of "|)")
~
Note that you don't need to add word-boundaries around \w+
I'm trying to build a function that matches the math expression between two greater (or equal) or smaller (or equal) symbols.
I have the following preg_match function:
preg_match("/(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)/", "2<(2+2)<8", $matches);
When I read the $matches array I get:
Array
(
[0] => <(2+2)<
[1] => <
[2] => (2+2)
[3] => )
[4] => <
)
Can anyone explain why the closing ) gets matched as part of the (2+2) and on it's own? I would like it to only match the whole (2+2).
Because you've got two capturing groups for the expression between comparison signs:
(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)
^^ ^ ^
|`----- $3 -----' |
`------- $2 ------'
Change it to
(<=?|>=?)((?:[0-9]|\+|\(|\))+)(<=?|>=?)
^^
Because you have a quantified capture group (...)+
Each pass through the capture group, resets the capture group to empty.
The result is you only see the last capture.
You can see it below as 3 start/end.
( <=? | >=? ) # (1)
( # (2 start)
( # (3 start)
[0-9]
| \+
| \(
| \)
)+ # (3 end)
) # (2 end)
( <=? | >=? ) # (4)
The individual pieces are of no use in this case,
changing it to a cluster group will exclude it from the output array.
( <=? | >=? ) # (1)
( # (2 start)
(?:
[0-9]
| \+
| \(
| \)
)+
) # (2 end)
( <=? | >=? ) # (3)
Output
** Grp 0 - ( pos 0 , len 7 )
<(2+2)<
** Grp 1 - ( pos 0 , len 1 )
<
** Grp 2 - ( pos 1 , len 5 )
(2+2)
** Grp 3 - ( pos 6 , len 1 )
<
Regarding my previous post I'm trying to match with regular expressions all use statements in a class file.
<?php
use Vendor\ProjectArticle\Model\Peer,
Vendor\Library\Template;
use Vendor\Blablabla;
$file = file_get_contents($class_path);
$a = preg_match_all('#use (?:(?<ns>[^,;]+),?)+;#mi', $file, $use);
var_dump(array('$a' => $a, '$use' => $use));
Unfortunately I'm not blessed with all namespaces used in case of multiple class names in one use statement. Only last one matched is being stored.
Array
(
[$a] => 2
[$use] => Array
(
[0] => Array
(
[0] => use Vendor\ProjectArticle\Model\Peer,
Vendor\Library\Template;
[1] => use Vendor\Blablabla;
)
[ns] => Array
(
[0] =>
Vendor\Library\Template
[1] => Vendor\Blablabla
)
[1] => Array
(
[0] =>
Vendor\Library\Template
[1] => Vendor\Blablabla
)
)
)
Can this be accomplished with some pattern modifier or something?
~Thanks
Should be able to use the \G anchor for this.
# '~(?:(?!\A)\G|^Use\s+),?\s*(?<ns>[^,;]+)(?=(?:,|[^,;]*)*;)~mi'
(?xmi-) # Inline modifier = expanded, multiline, case insensitive
(?:
(?! \A ) # Not beginning of string
\G # If matched before, start at end of last match
| # or,
^ Use \s+ # Beginning of line then 'Use' + whitespace
)
,? \s* # Whitespace trim
(?<ns> [^,;]+ ) # (1), A namespace value
(?= # Lookahead, each match validates a final ';'
(?: , | [^,;]* )*
;
)
Output:
** Grp 0 - ( pos 0 , len 36 )
use Vendor\ProjectArticle\Model\Peer
** Grp 1 - ( pos 4 , len 32 )
Vendor\ProjectArticle\Model\Peer
---------------------
** Grp 0 - ( pos 36 , len 30 )
,
Vendor\Library\Template
** Grp 1 - ( pos 43 , len 23 )
Vendor\Library\Template
---------------------
** Grp 0 - ( pos 69 , len 20 )
use Vendor\Blablabla
** Grp 1 - ( pos 73 , len 16 )
Vendor\Blablabla
I have two conditions in my regex (regex used on php)
(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))
When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4
BIOLOGIQUES 47 131002 / 4302
Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6
But when I test with the second condition the groups match are 5 6 7 and 8
Dossier N° : 47 131002 / 4302
The second condition here : http://www.rubular.com/r/eYzBJq1rIW
Is there a way to always have 1 2 3 and 4 match groups in the second condition too?
Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:
preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);
Use the u modifier to match UTF-8 characters correctly.
I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.
Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.
# (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))
(?:
( BIOLOGIQUES ) # (1)
| ( Dossier\ N \. \s+ : ) # (2)
)
\s+
( # (3 start)
( \d+ ) # (4)
\s+
( \d+ ) # (5)
\s+ \/ \s+
( \d+ ) # (6)
) # (3 end)
Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.
(?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))
(?|
br 1 ( # (1 start)
BIOLOGIQUES \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
|
br 1 ( # (1 start)
Dossier\ N . \s+ : \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
)
Or, you could factor it AND use branch reset.
# (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))
(?|
br 1 ( BIOLOGIQUES \s+ ) # (1)
|
br 1 ( Dossier\ N . \s+ : \s+ ) # (1)
)
(?:
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
)
Take a look at this regular expression:
(?:\(?")(.+)(?:"\)?)
This regex would match e.g
"a"
("a")
but also
"a)
How can I say that the starting character [ in this case " or ) ] is the same as the ending character? There must be a simplier solution than this, right?
"(.+)"|(?:\(")(.+)(?:"\))
I don't think there's a good way to do this specifically with regex, so you are stuck doing something like this:
/(?:
"(.+)"
|
\( (.+) \)
)/x
how about:
(\(?)(")(.+)\2\1
explanation:
(?-imsx:(\(?)(")(.+)\2\1)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of grouping
You can use Placeholders in PHP. But note, that this is not normal Regex behaviour, its special to PHP.:
preg_match("/<([^>]+)>(.+)<\/\1>/") (the \1 references the outcome of the first match)
This will use the first match as condition for the closing match. This matches <a>something</a> but not <h2>something</a>
However in your case you would need to turn the "(" matched within the first group into a ")" - which wont work.
Update: replacing ( and ) to <BRACE> AND <END_BRACE>. Then you can match using /<([^>]+)>(.+)<END_\1>/. Do this for all Required elements you use: ()[]{}<> and whatevs.
(a) is as nice as [f] will become <BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET> and the regex will capture both, if you use preg_match_all
$returnValue = preg_match_all('/<([^>]+)>(.+)<END_\\1>/', '<BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET>', $matches);
leads to
array (
0 =>
array (
0 => '<BRACE>a<END_BRACE>',
1 => '<BRACKET>f<END_BRACKET>',
),
1 =>
array (
0 => 'BRACE',
1 => 'BRACKET',
),
2 =>
array (
0 => 'a',
1 => 'f',
),
)