I have data in this format:
Randomtext1(random2, random4) Randomtext2 (ran dom) Randomtext3 Randomtext4 (random5,random7,random8) Randomtext5 (Randomtext4 (random5,random7,random8), random10) Randomtext11()
with this:
preg_match_all("/\b\w+\b(?:\s*\(.*?\)|)/",$text,$matches);
I obtain:
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)',
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8)',
5 => 'random10',
6 => 'Randomtext11()',
but I want
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)'
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8), random10)'
5 => 'Randomtext11()'
Any ideas?
You need a recursive pattern to handle nested parenthesis:
if ( preg_match_all('~\w+(?:\s*(\([^()]*+(?:(?1)[^()]*)*+\)))?~', $text, $matches) )
print_r($matches[0]);
demo
details:
~ # delimiter
\w+
(?:
\s*
( # capture group 1
\(
[^()]*+ # all that isn't a round bracket
# (possessive quantifier *+ to prevent too many backtracking
# steps in case of badly formatted string)
(?:
(?1) # recursion in the capture group 1
[^()]*
)*+
\)
) # close the capture group 1
)? # to make the group optional (instead of "|)")
~
Note that you don't need to add word-boundaries around \w+
Related
I've been trying to use a regular expression to match and extract parts of a URL.
The URL pattern looks like:
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
I intend to capture the following groups:
match and capture xyz (optional, but specific value)
match and capture fe/fi/fo5/fu2m (must exist, arbitrary value)
match and capture 123 (optional numeric value, which must appear at the end)
Here are expressions I have tried and problem encountered:
string1: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
string2: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))\/$
makes number at end mandatory
matches and captures all groups as required in string1 even when xyz is not included
no match in string2 because there's no number at the end
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))?\/$
makes number at end optional
captures only groups 1 and 2 in string1 and string2 . Number is matched along with group 2 in string2 as fe/fi/fo5/fu2m/123
My problem is how to capture groups 1, 2 and 3 in all scenarios incl. string1 and string2 (note: I am using PHP's preg_match function)
I will use parse_url first to extract the path from the url. Then all you have to do is to use a non-greedy quantifier in the second group :
$path = parse_url($url, PHP_URL_PATH);
if ( preg_match('~^\A/([^/]+)/(.*?)/(?:(\d+)/)?\z~', $path, $m) )
var_dump($m);
This way, if the number at the end is missing, the non-greedy quantifier (from the second group) is forced to reach the end of the string.
Use a modified URL validator.
'~^(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/(xyz))?((?:/(?!\d+/?$)[^/]*)+)(?:/(\d+))?/?\s*$~'
Group 1 is optional xyz
Group 2 is required middle
Group 3 is optional number at the end
Readable version
^
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?:
/
( xyz ) # Optional specific value
)?
( # Must exist, arbitrary value
(?:
/
(?! \d+ /? $ ) # Not a numeric value at the end
[^/]*
)+
)
(?:
/
( \d+ ) # Optional numeric value, which must appear at the end
)?
/?
\s*
$
Output
** Grp 0 - ( pos 0 : len 46 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
** Grp 1 - ( pos 21 : len 3 )
xyz
** Grp 2 - ( pos 24 : len 15 )
/fe/fi/fo5/fu2m
** Grp 3 - ( pos 40 : len 3 )
123
** Grp 0 - ( pos 48 : len 42 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
** Grp 1 - ( pos 69 : len 3 )
xyz
** Grp 2 - ( pos 72 : len 18 )
/fe/fi/fo5/fu2m/
** Grp 3 - NULL
How can i parse strings with regex to calculate the total seconds?
The strings will be in example:
40s
11m1s
1h47m3s
I started with the following regex
((\d+)h)((\d+)m)((\d+)s)
But this regex will only match the last example.
How can i make the parts optional?
Is there a better regex?
The format that you are using is very similar to the one that is used by java.time.Duration:
https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence-
Maybe you can use it instead of writing something custom?
Duration uses a format like this:
P1H47M3S
Maybe you can add the leading "P", and parse it (not sure if you have to uppercase)?
The format is called "ISO-8601":
https://en.wikipedia.org/wiki/ISO_8601
For example,
$set = array(
'40s',
'11m1s',
'1h47m3s'
);
$date = new DateTime();
$date2 = new DateTime();
foreach ($set as $value) {
$date2->add(new DateInterval('PT'.strtoupper($value)));
}
echo $date2->getTimestamp() - $date->getTimestamp(); // 7124 = 1hour 58mins 44secs.
You could use optional non-capture groups, for each (\dh, \dm, \ds):
$strs = ['40s', '11m1s', '1h47m3s'];
foreach ($strs as $str) {
if (preg_match('~(?:(\d+)h)?(?:(\d+)m)?(?:(\d+)s)?~', $str, $matches)) {
print_r($matches);
}
}
Outputs:
Array
(
[0] => 40s
[1] => // h
[2] => // m
[3] => 40 // s
)
Array
(
[0] => 11m1s
[1] => // h
[2] => 11 // m
[3] => 1 // s
)
Array
(
[0] => 1h47m3s
[1] => 1 // h
[2] => 47 // m
[3] => 3 // s
)
Regex:
(?: # non-capture group 1
( # capture group 1
\d+ # 1 or more number
) # end capture group1
h # letter 'h'
) # end non-capture group 1
? # optional
(?: # non-capture group 2
( # capture group 2
\d+ # 1 or more number
) # end capture group1
m # letter 'm'
) # end non-capture group 2
? # optional
(?: # non-capture group 3
( # capture group 3
\d+ # 1 or more number
) # end capture group1
s # letter 's'
) # end non-capture group 3
? # optional
This expression:
/(\d*?)s|(\d*?)m(\d*?)s|(\d*?)h(\d*?)m(\d*?)s/gm
returns 3 matches, one for each line. Each match is separated into the salient groups of only numbers.
The gist is that this will match either any number of digits before an 's' or that plus any number of digits before an 'm' or that plus any number of digits before an 'h'.
https://www.tehplayground.com/KWmxySzbC9VoDvP9
Why is the first string matched?
$list = [
'3928.3939392', // Should not be matched
'4.239,99',
'39',
'3929',
'2993.39',
'393993.999'
];
foreach($list as $str){
preg_match('/^(?<![\d.,])-?\d{1,3}(?:[,. ]?\d{3})*(?:[^.,%]|[.,]\d{1,2})-?(?![\d.,%]|(?: %))$/', $str, $matches);
print_r($matches);
}
output
Array
(
[0] => 3928.3939392
)
Array
(
[0] => 4.239,99
)
Array
(
[0] => 39
)
Array
(
[0] => 3929
)
Array
(
[0] => 2993.39
)
Array
(
)
You seem to want to match the numbers as standalone strings, and thus, you do not need the lookarounds, you only need to use anchors.
You may use
^-?(?:\d{1,3}(?:[,. ]\d{3})*|\d*)(?:[.,]\d{1,2})?$
See the regex demo
Details
^ - start of string
-? - an optional -
(?: - start of a non-capturing alternation group:
\d{1,3}(?:[,. ]\d{3})* - 1 to 3 digits, followed with 0+ sequences of ,, . or space and then 3 digits
| - or
\d* - 0+ digits
) - end of the group
(?:[.,]\d{1,2})? - an optional sequence of . or , followed with 1 or 2 digits
$ - end of string.
I have two conditions in my regex (regex used on php)
(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))
When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4
BIOLOGIQUES 47 131002 / 4302
Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6
But when I test with the second condition the groups match are 5 6 7 and 8
Dossier N° : 47 131002 / 4302
The second condition here : http://www.rubular.com/r/eYzBJq1rIW
Is there a way to always have 1 2 3 and 4 match groups in the second condition too?
Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:
preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);
Use the u modifier to match UTF-8 characters correctly.
I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.
Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.
# (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))
(?:
( BIOLOGIQUES ) # (1)
| ( Dossier\ N \. \s+ : ) # (2)
)
\s+
( # (3 start)
( \d+ ) # (4)
\s+
( \d+ ) # (5)
\s+ \/ \s+
( \d+ ) # (6)
) # (3 end)
Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.
(?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))
(?|
br 1 ( # (1 start)
BIOLOGIQUES \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
|
br 1 ( # (1 start)
Dossier\ N . \s+ : \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
)
Or, you could factor it AND use branch reset.
# (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))
(?|
br 1 ( BIOLOGIQUES \s+ ) # (1)
|
br 1 ( Dossier\ N . \s+ : \s+ ) # (1)
)
(?:
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
)
I have a string like
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
I want to explode it to array
Array (
0 => "first",
1 => "second[,b]",
2 => "third[a,b[1,2,3]]",
3 => "fourth[a[1,2]]",
4 => "sixth"
}
I tried to remove brackets:
preg_replace("/[ ( (?>[^[]]+) | (?R) )* ]/xis",
"",
"first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth"
);
But got stuck one the next step
PHP's regex flavor supports recursive patterns, so something like this would work:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth";
preg_match_all('/[^,\[\]]+(\[([^\[\]]|(?1))*])?/', $text, $matches);
print_r($matches[0]);
which will print:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
)
The key here is not to split, but match.
Whether you want to add such a cryptic regex to your code base, is up to you :)
EDIT
I just realized that my suggestion above will not match entries starting with [. To do that, do it like this:
$text = "first,second[,b],third[a,b[1,2,3]],fourth[a[1,2]],sixth,[s,[,e,[,v,],e,],n]";
preg_match_all("/
( # start match group 1
[^,\[\]] # any char other than a comma or square bracket
| # OR
\[ # an opening square bracket
( # start match group 2
[^\[\]] # any char other than a square bracket
| # OR
(?R) # recursively match the entire pattern
)* # end match group 2, and repeat it zero or more times
] # an closing square bracket
)+ # end match group 1, and repeat it once or more times
/x",
$text,
$matches
);
print_r($matches[0]);
which prints:
Array
(
[0] => first
[1] => second[,b]
[2] => third[a,b[1,2,3]]
[3] => fourth[a[1,2]]
[4] => sixth
[5] => [s,[,e,[,v,],e,],n]
)