Regex ignore if empty - php

I have two conditions in my regex (regex used on php)
(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))
When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4
BIOLOGIQUES 47 131002 / 4302
Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6
But when I test with the second condition the groups match are 5 6 7 and 8
Dossier N° : 47 131002 / 4302
The second condition here : http://www.rubular.com/r/eYzBJq1rIW
Is there a way to always have 1 2 3 and 4 match groups in the second condition too?

Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:
preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);
Use the u modifier to match UTF-8 characters correctly.

I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.
Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.
# (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))
(?:
( BIOLOGIQUES ) # (1)
| ( Dossier\ N \. \s+ : ) # (2)
)
\s+
( # (3 start)
( \d+ ) # (4)
\s+
( \d+ ) # (5)
\s+ \/ \s+
( \d+ ) # (6)
) # (3 end)
Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.
(?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))
(?|
br 1 ( # (1 start)
BIOLOGIQUES \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
|
br 1 ( # (1 start)
Dossier\ N . \s+ : \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
)
Or, you could factor it AND use branch reset.
# (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))
(?|
br 1 ( BIOLOGIQUES \s+ ) # (1)
|
br 1 ( Dossier\ N . \s+ : \s+ ) # (1)
)
(?:
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
)

Related

Regular expression that matches given pattern and ends with optional number

I've been trying to use a regular expression to match and extract parts of a URL.
The URL pattern looks like:
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
I intend to capture the following groups:
match and capture xyz (optional, but specific value)
match and capture fe/fi/fo5/fu2m (must exist, arbitrary value)
match and capture 123 (optional numeric value, which must appear at the end)
Here are expressions I have tried and problem encountered:
string1: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
string2: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))\/$
makes number at end mandatory
matches and captures all groups as required in string1 even when xyz is not included
no match in string2 because there's no number at the end
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))?\/$
makes number at end optional
captures only groups 1 and 2 in string1 and string2 . Number is matched along with group 2 in string2 as fe/fi/fo5/fu2m/123
My problem is how to capture groups 1, 2 and 3 in all scenarios incl. string1 and string2 (note: I am using PHP's preg_match function)
I will use parse_url first to extract the path from the url. Then all you have to do is to use a non-greedy quantifier in the second group :
$path = parse_url($url, PHP_URL_PATH);
if ( preg_match('~^\A/([^/]+)/(.*?)/(?:(\d+)/)?\z~', $path, $m) )
var_dump($m);
This way, if the number at the end is missing, the non-greedy quantifier (from the second group) is forced to reach the end of the string.
Use a modified URL validator.
'~^(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/(xyz))?((?:/(?!\d+/?$)[^/]*)+)(?:/(\d+))?/?\s*$~'
Group 1 is optional xyz
Group 2 is required middle
Group 3 is optional number at the end
Readable version
^
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?:
/
( xyz ) # Optional specific value
)?
( # Must exist, arbitrary value
(?:
/
(?! \d+ /? $ ) # Not a numeric value at the end
[^/]*
)+
)
(?:
/
( \d+ ) # Optional numeric value, which must appear at the end
)?
/?
\s*
$
Output
** Grp 0 - ( pos 0 : len 46 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
** Grp 1 - ( pos 21 : len 3 )
xyz
** Grp 2 - ( pos 24 : len 15 )
/fe/fi/fo5/fu2m
** Grp 3 - ( pos 40 : len 3 )
123
** Grp 0 - ( pos 48 : len 42 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
** Grp 1 - ( pos 69 : len 3 )
xyz
** Grp 2 - ( pos 72 : len 18 )
/fe/fi/fo5/fu2m/
** Grp 3 - NULL

preg_match_all split conditional expression

I have data in this format:
Randomtext1(random2, random4) Randomtext2 (ran dom) Randomtext3 Randomtext4 (random5,random7,random8) Randomtext5 (Randomtext4 (random5,random7,random8), random10) Randomtext11()
with this:
preg_match_all("/\b\w+\b(?:\s*\(.*?\)|)/",$text,$matches);
I obtain:
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)',
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8)',
5 => 'random10',
6 => 'Randomtext11()',
but I want
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)'
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8), random10)'
5 => 'Randomtext11()'
Any ideas?
You need a recursive pattern to handle nested parenthesis:
if ( preg_match_all('~\w+(?:\s*(\([^()]*+(?:(?1)[^()]*)*+\)))?~', $text, $matches) )
print_r($matches[0]);
demo
details:
~ # delimiter
\w+
(?:
\s*
( # capture group 1
\(
[^()]*+ # all that isn't a round bracket
# (possessive quantifier *+ to prevent too many backtracking
# steps in case of badly formatted string)
(?:
(?1) # recursion in the capture group 1
[^()]*
)*+
\)
) # close the capture group 1
)? # to make the group optional (instead of "|)")
~
Note that you don't need to add word-boundaries around \w+

Matching math between < > symbols

I'm trying to build a function that matches the math expression between two greater (or equal) or smaller (or equal) symbols.
I have the following preg_match function:
preg_match("/(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)/", "2<(2+2)<8", $matches);
When I read the $matches array I get:
Array
(
[0] => <(2+2)<
[1] => <
[2] => (2+2)
[3] => )
[4] => <
)
Can anyone explain why the closing ) gets matched as part of the (2+2) and on it's own? I would like it to only match the whole (2+2).
Because you've got two capturing groups for the expression between comparison signs:
(<=?|>=?)(([0-9]|\+|\(|\))+)(<=?|>=?)
^^ ^ ^
|`----- $3 -----' |
`------- $2 ------'
Change it to
(<=?|>=?)((?:[0-9]|\+|\(|\))+)(<=?|>=?)
^^
Because you have a quantified capture group (...)+
Each pass through the capture group, resets the capture group to empty.
The result is you only see the last capture.
You can see it below as 3 start/end.
( <=? | >=? ) # (1)
( # (2 start)
( # (3 start)
[0-9]
| \+
| \(
| \)
)+ # (3 end)
) # (2 end)
( <=? | >=? ) # (4)
The individual pieces are of no use in this case,
changing it to a cluster group will exclude it from the output array.
( <=? | >=? ) # (1)
( # (2 start)
(?:
[0-9]
| \+
| \(
| \)
)+
) # (2 end)
( <=? | >=? ) # (3)
Output
** Grp 0 - ( pos 0 , len 7 )
<(2+2)<
** Grp 1 - ( pos 0 , len 1 )
<
** Grp 2 - ( pos 1 , len 5 )
(2+2)
** Grp 3 - ( pos 6 , len 1 )
<

Getting part of string after space

I'm receiving string from the Wikipedia APi which look like this:
{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]
I have to use both the actual url's, and the description of the url. So for example, for
[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]
I need to have "http://www.bbc.co.uk/news/world-europe-17298730" and also "France] from the [[BBC News]] " but without the brackets, like so "France from the BBC News".
I managed to get the first parts, by doing the following:
if(preg_match_all('/\[http(.*?)\s/',$result,$extmatch)) {
$mt= str_replace("[[","",$extmatch[1]);
But I don't know how to go around getting the second part (I'm quite weak at regex unfortunately :-( ).
Any ideas?
A solution not using regex:
Explode the string at '*'
Ditch the parts starting with '{';
Remove all the brackets
Explode the String at 'space'
The first part is the link
Glue back together the rest for the description
The code:
$parts=explode('*',$str);
$links=array();
foreach($parts as $k=>$v){
$parts[$k]=ltrim($v);
if(substr($parts[$k],0,1)!=='['){
unset($parts[$k]);
continue;
}
$parts[$k]=preg_replace('/\[|\]/','',$parts[$k]);
$subparts=explode(' ',$parts[$k]);
$links[$k][0]=$subparts[0];
unset($subparts[0]);
$links[$k][1]=implode(' ',$subparts);
}
echo '<pre>'.print_r($links,true).'</pre>';
The result:
Array
(
[1] => Array
(
[0] => http://www.bbc.co.uk/news/world-europe-17298730
[1] => France from the BBC News
)
[2] => Array
(
[0] => http://ucblibraries.colorado.edu/govpubs/for/france.htm
[1] => France at ''UCB Libraries GovPubs''
)
[4] => Array
(
[0] => http://www.britannica.com/EBchecked/topic/215768/France
[1] => France ''Encyclopædia Britannica'' entry
)
[5] => Array
(
[0] => http://europa.eu/about-eu/countries/member-countries/france/index_en.htm
[1] => France at the European Union|EU
)
[8] => Array
(
[0] => http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR
[1] => Key Development Forecasts for France from International Futures ;Economy
)
[10] => Array
(
[0] => http://stats.oecd.org/Index.aspx?QueryId=14594
[1] => OECD France statistics
)
)
PHP:
$input = "{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]";
$regex = '/\[(http\S+)\s+([^\]]+)\](?:\s+from(?:\s+the)?\s+\[\[(.*?)\]\])?/';
preg_match_all($regex, $input, $matches, PREG_SET_ORDER);
var_dump($matches);
Output:
array(6) {
[0]=>
array(4) {
[0]=>
string(78) "[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]"
[1]=>
string(47) "http://www.bbc.co.uk/news/world-europe-17298730"
[2]=>
string(6) "France"
[3]=>
string(8) "BBC News"
}
...
...
...
...
...
}
Explanation:
\[ (?# match [ literally)
( (?# start capture group)
http (?# match http literally)
\S+ (?# match 1+ non-whitespace characters)
) (?# end capture group)
\s+ (?# match 1+ whitespace characters)
( (?# start capture group)
[^\]]+ (?# match 1+ non-] characters)
) (?# end capture group)
\] (?# match ] literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
from (?# match from literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
the (?# match the literally)
)? (?# end optional non-capturing group)
\s+ (?# match 1+ whitespace characters)
\[\[ (?# match [[ literally)
( (?# start capturing group)
.*? (?# lazily match 0+ characters)
) (?# end capturing group)
\]\] (?# match ]] literally)
)? (?# end optional non-caputring group)
Let me know if you need a more thorough explanation, but my comments above should help. If you have any specific questions I'd be more than happy to help. Link below will help you visualize what the expression is doing.
Regex101

regular expression end tag = start tag

Take a look at this regular expression:
(?:\(?")(.+)(?:"\)?)
This regex would match e.g
"a"
("a")
but also
"a)
How can I say that the starting character [ in this case " or ) ] is the same as the ending character? There must be a simplier solution than this, right?
"(.+)"|(?:\(")(.+)(?:"\))
I don't think there's a good way to do this specifically with regex, so you are stuck doing something like this:
/(?:
"(.+)"
|
\( (.+) \)
)/x
how about:
(\(?)(")(.+)\2\1
explanation:
(?-imsx:(\(?)(")(.+)\2\1)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of grouping
You can use Placeholders in PHP. But note, that this is not normal Regex behaviour, its special to PHP.:
preg_match("/<([^>]+)>(.+)<\/\1>/") (the \1 references the outcome of the first match)
This will use the first match as condition for the closing match. This matches <a>something</a> but not <h2>something</a>
However in your case you would need to turn the "(" matched within the first group into a ")" - which wont work.
Update: replacing ( and ) to <BRACE> AND <END_BRACE>. Then you can match using /<([^>]+)>(.+)<END_\1>/. Do this for all Required elements you use: ()[]{}<> and whatevs.
(a) is as nice as [f] will become <BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET> and the regex will capture both, if you use preg_match_all
$returnValue = preg_match_all('/<([^>]+)>(.+)<END_\\1>/', '<BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET>', $matches);
leads to
array (
0 =>
array (
0 => '<BRACE>a<END_BRACE>',
1 => '<BRACKET>f<END_BRACKET>',
),
1 =>
array (
0 => 'BRACE',
1 => 'BRACKET',
),
2 =>
array (
0 => 'a',
1 => 'f',
),
)

Categories