I'm trying to build a function that matches the math expression between two greater (or equal) or smaller (or equal) symbols.
I have the following preg_match function:
preg_match("/(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)/", "2<(2+2)<8", $matches);
When I read the $matches array I get:
Array
(
[0] => <(2+2)<
[1] => <
[2] => (2+2)
[3] => )
[4] => <
)
Can anyone explain why the closing ) gets matched as part of the (2+2) and on it's own? I would like it to only match the whole (2+2).
Because you've got two capturing groups for the expression between comparison signs:
(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)
^^ ^ ^
|`----- $3 -----' |
`------- $2 ------'
Change it to
(<=?|>=?)((?:[0-9]|\+|\(|\))+)(<=?|>=?)
^^
Because you have a quantified capture group (...)+
Each pass through the capture group, resets the capture group to empty.
The result is you only see the last capture.
You can see it below as 3 start/end.
( <=? | >=? ) # (1)
( # (2 start)
( # (3 start)
[0-9]
| \+
| \(
| \)
)+ # (3 end)
) # (2 end)
( <=? | >=? ) # (4)
The individual pieces are of no use in this case,
changing it to a cluster group will exclude it from the output array.
( <=? | >=? ) # (1)
( # (2 start)
(?:
[0-9]
| \+
| \(
| \)
)+
) # (2 end)
( <=? | >=? ) # (3)
Output
** Grp 0 - ( pos 0 , len 7 )
<(2+2)<
** Grp 1 - ( pos 0 , len 1 )
<
** Grp 2 - ( pos 1 , len 5 )
(2+2)
** Grp 3 - ( pos 6 , len 1 )
<
Related
I've been trying to use a regular expression to match and extract parts of a URL.
The URL pattern looks like:
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
I intend to capture the following groups:
match and capture xyz (optional, but specific value)
match and capture fe/fi/fo5/fu2m (must exist, arbitrary value)
match and capture 123 (optional numeric value, which must appear at the end)
Here are expressions I have tried and problem encountered:
string1: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
string2: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))\/$
makes number at end mandatory
matches and captures all groups as required in string1 even when xyz is not included
no match in string2 because there's no number at the end
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))?\/$
makes number at end optional
captures only groups 1 and 2 in string1 and string2 . Number is matched along with group 2 in string2 as fe/fi/fo5/fu2m/123
My problem is how to capture groups 1, 2 and 3 in all scenarios incl. string1 and string2 (note: I am using PHP's preg_match function)
I will use parse_url first to extract the path from the url. Then all you have to do is to use a non-greedy quantifier in the second group :
$path = parse_url($url, PHP_URL_PATH);
if ( preg_match('~^\A/([^/]+)/(.*?)/(?:(\d+)/)?\z~', $path, $m) )
var_dump($m);
This way, if the number at the end is missing, the non-greedy quantifier (from the second group) is forced to reach the end of the string.
Use a modified URL validator.
'~^(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/(xyz))?((?:/(?!\d+/?$)[^/]*)+)(?:/(\d+))?/?\s*$~'
Group 1 is optional xyz
Group 2 is required middle
Group 3 is optional number at the end
Readable version
^
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?:
/
( xyz ) # Optional specific value
)?
( # Must exist, arbitrary value
(?:
/
(?! \d+ /? $ ) # Not a numeric value at the end
[^/]*
)+
)
(?:
/
( \d+ ) # Optional numeric value, which must appear at the end
)?
/?
\s*
$
Output
** Grp 0 - ( pos 0 : len 46 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
** Grp 1 - ( pos 21 : len 3 )
xyz
** Grp 2 - ( pos 24 : len 15 )
/fe/fi/fo5/fu2m
** Grp 3 - ( pos 40 : len 3 )
123
** Grp 0 - ( pos 48 : len 42 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
** Grp 1 - ( pos 69 : len 3 )
xyz
** Grp 2 - ( pos 72 : len 18 )
/fe/fi/fo5/fu2m/
** Grp 3 - NULL
How can i parse strings with regex to calculate the total seconds?
The strings will be in example:
40s
11m1s
1h47m3s
I started with the following regex
((\d+)h)((\d+)m)((\d+)s)
But this regex will only match the last example.
How can i make the parts optional?
Is there a better regex?
The format that you are using is very similar to the one that is used by java.time.Duration:
https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence-
Maybe you can use it instead of writing something custom?
Duration uses a format like this:
P1H47M3S
Maybe you can add the leading "P", and parse it (not sure if you have to uppercase)?
The format is called "ISO-8601":
https://en.wikipedia.org/wiki/ISO_8601
For example,
$set = array(
'40s',
'11m1s',
'1h47m3s'
);
$date = new DateTime();
$date2 = new DateTime();
foreach ($set as $value) {
$date2->add(new DateInterval('PT'.strtoupper($value)));
}
echo $date2->getTimestamp() - $date->getTimestamp(); // 7124 = 1hour 58mins 44secs.
You could use optional non-capture groups, for each (\dh, \dm, \ds):
$strs = ['40s', '11m1s', '1h47m3s'];
foreach ($strs as $str) {
if (preg_match('~(?:(\d+)h)?(?:(\d+)m)?(?:(\d+)s)?~', $str, $matches)) {
print_r($matches);
}
}
Outputs:
Array
(
[0] => 40s
[1] => // h
[2] => // m
[3] => 40 // s
)
Array
(
[0] => 11m1s
[1] => // h
[2] => 11 // m
[3] => 1 // s
)
Array
(
[0] => 1h47m3s
[1] => 1 // h
[2] => 47 // m
[3] => 3 // s
)
Regex:
(?: # non-capture group 1
( # capture group 1
\d+ # 1 or more number
) # end capture group1
h # letter 'h'
) # end non-capture group 1
? # optional
(?: # non-capture group 2
( # capture group 2
\d+ # 1 or more number
) # end capture group1
m # letter 'm'
) # end non-capture group 2
? # optional
(?: # non-capture group 3
( # capture group 3
\d+ # 1 or more number
) # end capture group1
s # letter 's'
) # end non-capture group 3
? # optional
This expression:
/(\d*?)s|(\d*?)m(\d*?)s|(\d*?)h(\d*?)m(\d*?)s/gm
returns 3 matches, one for each line. Each match is separated into the salient groups of only numbers.
The gist is that this will match either any number of digits before an 's' or that plus any number of digits before an 'm' or that plus any number of digits before an 'h'.
I have data in this format:
Randomtext1(random2, random4) Randomtext2 (ran dom) Randomtext3 Randomtext4 (random5,random7,random8) Randomtext5 (Randomtext4 (random5,random7,random8), random10) Randomtext11()
with this:
preg_match_all("/\b\w+\b(?:\s*\(.*?\)|)/",$text,$matches);
I obtain:
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)',
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8)',
5 => 'random10',
6 => 'Randomtext11()',
but I want
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)'
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8), random10)'
5 => 'Randomtext11()'
Any ideas?
You need a recursive pattern to handle nested parenthesis:
if ( preg_match_all('~\w+(?:\s*(\([^()]*+(?:(?1)[^()]*)*+\)))?~', $text, $matches) )
print_r($matches[0]);
demo
details:
~ # delimiter
\w+
(?:
\s*
( # capture group 1
\(
[^()]*+ # all that isn't a round bracket
# (possessive quantifier *+ to prevent too many backtracking
# steps in case of badly formatted string)
(?:
(?1) # recursion in the capture group 1
[^()]*
)*+
\)
) # close the capture group 1
)? # to make the group optional (instead of "|)")
~
Note that you don't need to add word-boundaries around \w+
Regarding my previous post I'm trying to match with regular expressions all use statements in a class file.
<?php
use Vendor\ProjectArticle\Model\Peer,
Vendor\Library\Template;
use Vendor\Blablabla;
$file = file_get_contents($class_path);
$a = preg_match_all('#use (?:(?<ns>[^,;]+),?)+;#mi', $file, $use);
var_dump(array('$a' => $a, '$use' => $use));
Unfortunately I'm not blessed with all namespaces used in case of multiple class names in one use statement. Only last one matched is being stored.
Array
(
[$a] => 2
[$use] => Array
(
[0] => Array
(
[0] => use Vendor\ProjectArticle\Model\Peer,
Vendor\Library\Template;
[1] => use Vendor\Blablabla;
)
[ns] => Array
(
[0] =>
Vendor\Library\Template
[1] => Vendor\Blablabla
)
[1] => Array
(
[0] =>
Vendor\Library\Template
[1] => Vendor\Blablabla
)
)
)
Can this be accomplished with some pattern modifier or something?
~Thanks
Should be able to use the \G anchor for this.
# '~(?:(?!\A)\G|^Use\s+),?\s*(?<ns>[^,;]+)(?=(?:,|[^,;]*)*;)~mi'
(?xmi-) # Inline modifier = expanded, multiline, case insensitive
(?:
(?! \A ) # Not beginning of string
\G # If matched before, start at end of last match
| # or,
^ Use \s+ # Beginning of line then 'Use' + whitespace
)
,? \s* # Whitespace trim
(?<ns> [^,;]+ ) # (1), A namespace value
(?= # Lookahead, each match validates a final ';'
(?: , | [^,;]* )*
;
)
Output:
** Grp 0 - ( pos 0 , len 36 )
use Vendor\ProjectArticle\Model\Peer
** Grp 1 - ( pos 4 , len 32 )
Vendor\ProjectArticle\Model\Peer
---------------------
** Grp 0 - ( pos 36 , len 30 )
,
Vendor\Library\Template
** Grp 1 - ( pos 43 , len 23 )
Vendor\Library\Template
---------------------
** Grp 0 - ( pos 69 , len 20 )
use Vendor\Blablabla
** Grp 1 - ( pos 73 , len 16 )
Vendor\Blablabla
I have two conditions in my regex (regex used on php)
(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))
When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4
BIOLOGIQUES 47 131002 / 4302
Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6
But when I test with the second condition the groups match are 5 6 7 and 8
Dossier N° : 47 131002 / 4302
The second condition here : http://www.rubular.com/r/eYzBJq1rIW
Is there a way to always have 1 2 3 and 4 match groups in the second condition too?
Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:
preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);
Use the u modifier to match UTF-8 characters correctly.
I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.
Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.
# (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))
(?:
( BIOLOGIQUES ) # (1)
| ( Dossier\ N \. \s+ : ) # (2)
)
\s+
( # (3 start)
( \d+ ) # (4)
\s+
( \d+ ) # (5)
\s+ \/ \s+
( \d+ ) # (6)
) # (3 end)
Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.
(?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))
(?|
br 1 ( # (1 start)
BIOLOGIQUES \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
|
br 1 ( # (1 start)
Dossier\ N . \s+ : \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
)
Or, you could factor it AND use branch reset.
# (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))
(?|
br 1 ( BIOLOGIQUES \s+ ) # (1)
|
br 1 ( Dossier\ N . \s+ : \s+ ) # (1)
)
(?:
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
)