Little help with regex - php

how can I match these:
(1, 'asd', 'asd2')
but not match this:
(1, '(data)', 0)
I want to match the ( and ), but not match ( and ) inside ( and ).
Actually these are queries and I want to split them via preg_split.
/[\(*\)]+/
splits them, but also splits ( and ) inside them, how can I fix this?
Example:
The data is:
(1, 'user1', 1, 0, 0, 0)(2, 'user(2)', 1, 0, 0, 1)
I want to split them as:
Array(
0 => (1, 'user1', 1, 0, 0, 0)
1 => (2, 'user(2)', 1, 0, 0, 1)
);
instead of it, its splitted as:
Array(
0 => (1, 'user1', 1, 0, 0, 0)
1 => (2, 'user
2 => 2
3 => ', 1, 0, 0, 1)
);

A regex for this would be a little nasty. Instead, you can iterate over the entire string and decide where to split:
If it's a ), split there. (I'm assuming the brackets are balanced in the string and can't be nested)
If it's a ', ignore any ) until a closing ' (If it can be escaped, you can look at the previous characters for an odd number of \).
I think this is a more straight-forward solution than a regex.

You can't use preg_split for that (as you don't match borders, but lengthier patterns). But it might be possible with a preg_match_all:
preg_match_all(':\( ((?R) | .)*? \):x', $source, $matches);
print_r($matches[0]);
Instead of a ?R recursive version, you could also just prepare the pattern for a single level of internal parenthesis. But that wouldn't look much simpler actually.
:\( ( [^()]* | \( [^()]* \) )+ \):x

Your grammar appears to be
list: '(' num ( ',' term )(s?) ')'
term: num | str
num: /[0-9]+/
str: /'[^']*'/
So the pattern is
/ \G \s* \( \s* [0-9]+ (?: \s* , \s* (?: [0-9]+ | '[^']*' ) )* \s* \) /x
Well, that's just for matching. Extraction is tricker if PHP works like Perl. If you want to do with with regex match, you have to do it in two passes.
First you extract the list:
/ \G \s* \( \s* ( [0-9]+ (?: \s* , \s* (?: [0-9]+ | '[^']*' ) )* ) \s* \) /x
Then you extract the terms from the list:
/ \G \s* ( [0-9]+ | '[^']*' ) (?: \s* , )? /x

Related

Regular expression that matches given pattern and ends with optional number

I've been trying to use a regular expression to match and extract parts of a URL.
The URL pattern looks like:
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
I intend to capture the following groups:
match and capture xyz (optional, but specific value)
match and capture fe/fi/fo5/fu2m (must exist, arbitrary value)
match and capture 123 (optional numeric value, which must appear at the end)
Here are expressions I have tried and problem encountered:
string1: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
string2: http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))\/$
makes number at end mandatory
matches and captures all groups as required in string1 even when xyz is not included
no match in string2 because there's no number at the end
^(?:https?:\/\/)?(?:[\da-z\.-]+)\.(?:[a-z\.]{2,6})(?:\/(xyz))?\/([\/\w]+)+(?:\/([\d]+))?\/$
makes number at end optional
captures only groups 1 and 2 in string1 and string2 . Number is matched along with group 2 in string2 as fe/fi/fo5/fu2m/123
My problem is how to capture groups 1, 2 and 3 in all scenarios incl. string1 and string2 (note: I am using PHP's preg_match function)
I will use parse_url first to extract the path from the url. Then all you have to do is to use a non-greedy quantifier in the second group :
$path = parse_url($url, PHP_URL_PATH);
if ( preg_match('~^\A/([^/]+)/(.*?)/(?:(\d+)/)?\z~', $path, $m) )
var_dump($m);
This way, if the number at the end is missing, the non-greedy quantifier (from the second group) is forced to reach the end of the string.
Use a modified URL validator.
'~^(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/(xyz))?((?:/(?!\d+/?$)[^/]*)+)(?:/(\d+))?/?\s*$~'
Group 1 is optional xyz
Group 2 is required middle
Group 3 is optional number at the end
Readable version
^
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?:
/
( xyz ) # Optional specific value
)?
( # Must exist, arbitrary value
(?:
/
(?! \d+ /? $ ) # Not a numeric value at the end
[^/]*
)+
)
(?:
/
( \d+ ) # Optional numeric value, which must appear at the end
)?
/?
\s*
$
Output
** Grp 0 - ( pos 0 : len 46 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/123/
** Grp 1 - ( pos 21 : len 3 )
xyz
** Grp 2 - ( pos 24 : len 15 )
/fe/fi/fo5/fu2m
** Grp 3 - ( pos 40 : len 3 )
123
** Grp 0 - ( pos 48 : len 42 )
http://domain.abcdef/xyz/fe/fi/fo5/fu2m/
** Grp 1 - ( pos 69 : len 3 )
xyz
** Grp 2 - ( pos 72 : len 18 )
/fe/fi/fo5/fu2m/
** Grp 3 - NULL

preg_match_all split conditional expression

I have data in this format:
Randomtext1(random2, random4) Randomtext2 (ran dom) Randomtext3 Randomtext4 (random5,random7,random8) Randomtext5 (Randomtext4 (random5,random7,random8), random10) Randomtext11()
with this:
preg_match_all("/\b\w+\b(?:\s*\(.*?\)|)/",$text,$matches);
I obtain:
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)',
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8)',
5 => 'random10',
6 => 'Randomtext11()',
but I want
0 => 'Randomtext1(random2, random4)',
1 => 'Randomtext2 (ran dom)',
2 => 'Randomtext3',
3 => 'Randomtext4 (random5,random7,random8)'
4 => 'Randomtext5 (Randomtext4 (random5,random7,random8), random10)'
5 => 'Randomtext11()'
Any ideas?
You need a recursive pattern to handle nested parenthesis:
if ( preg_match_all('~\w+(?:\s*(\([^()]*+(?:(?1)[^()]*)*+\)))?~', $text, $matches) )
print_r($matches[0]);
demo
details:
~ # delimiter
\w+
(?:
\s*
( # capture group 1
\(
[^()]*+ # all that isn't a round bracket
# (possessive quantifier *+ to prevent too many backtracking
# steps in case of badly formatted string)
(?:
(?1) # recursion in the capture group 1
[^()]*
)*+
\)
) # close the capture group 1
)? # to make the group optional (instead of "|)")
~
Note that you don't need to add word-boundaries around \w+

php: brackets/contents from a array?

If I have a string like this:
$str = '[tr]Kapadokya[/tr][en]Cappadocia[/en][de]Test[/de]';
I want that
$array = array(
'tr' => 'Kapadokya',
'en' => 'Cappadocia',
'de' => 'Test');
How do I do this?
With a few assumptions about the actual syntax of your BBCode-ish string the following (pc) regular expression might suffice.
<?php
$str = '[tr]Kapadokya[/tr][en]Cappadocia[/en][de]Test[/de]';
$pattern = '!
\[
([^\]]+)
\]
(.+)
\[
/
\\1
\]
!x';
/* alternative, probably better expression (see comments)
$pattern = '!
\[ (?# pattern start with a literal [ )
([^\]]+) (?# is followed by one or more characters other than ] - those characters are grouped as subcapture #1, see below )
\] (?# is followed by one literal ] )
( (?# capture all following characters )
[^[]+ (?# as long as not a literal ] is encountered - there must be at least one such character )
)
\[ (?# pattern ends with a literal [ and )
/ (?# literal / )
\1 (?# the same characters as between the opening [...] - that's subcapture #1 )
\] (?# and finally a literal ] )
!x'; // the x modifier allows us to make the pattern easier to read because literal white spaces are ignored
*/
preg_match_all($pattern, $str, $matches);
var_export($matches);
prints
array (
0 =>
array (
0 => '[tr]Kapadokya[/tr]',
1 => '[en]Cappadocia[/en]',
2 => '[de]Test[/de]',
),
1 =>
array (
0 => 'tr',
1 => 'en',
2 => 'de',
),
2 =>
array (
0 => 'Kapadokya',
1 => 'Cappadocia',
2 => 'Test',
),
)
see also: http://docs.php.net/pcre

regular expression end tag = start tag

Take a look at this regular expression:
(?:\(?")(.+)(?:"\)?)
This regex would match e.g
"a"
("a")
but also
"a)
How can I say that the starting character [ in this case " or ) ] is the same as the ending character? There must be a simplier solution than this, right?
"(.+)"|(?:\(")(.+)(?:"\))
I don't think there's a good way to do this specifically with regex, so you are stuck doing something like this:
/(?:
"(.+)"
|
\( (.+) \)
)/x
how about:
(\(?)(")(.+)\2\1
explanation:
(?-imsx:(\(?)(")(.+)\2\1)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of grouping
You can use Placeholders in PHP. But note, that this is not normal Regex behaviour, its special to PHP.:
preg_match("/<([^>]+)>(.+)<\/\1>/") (the \1 references the outcome of the first match)
This will use the first match as condition for the closing match. This matches <a>something</a> but not <h2>something</a>
However in your case you would need to turn the "(" matched within the first group into a ")" - which wont work.
Update: replacing ( and ) to <BRACE> AND <END_BRACE>. Then you can match using /<([^>]+)>(.+)<END_\1>/. Do this for all Required elements you use: ()[]{}<> and whatevs.
(a) is as nice as [f] will become <BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET> and the regex will capture both, if you use preg_match_all
$returnValue = preg_match_all('/<([^>]+)>(.+)<END_\\1>/', '<BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET>', $matches);
leads to
array (
0 =>
array (
0 => '<BRACE>a<END_BRACE>',
1 => '<BRACKET>f<END_BRACKET>',
),
1 =>
array (
0 => 'BRACE',
1 => 'BRACKET',
),
2 =>
array (
0 => 'a',
1 => 'f',
),
)

Regex to remove outer brackets

I have been using this /\(\s*([^)]+?)\s*\)/ regex to remove outer brackets with PHP preg_replace function (Read more in my previous question Regex to match any character except trailing spaces).
This works fine when there is only one pair of brackets, but problem is when there is more, for example ( test1 t3() test2) becomes test1 t3( test2) instead test1 t3() test2.
I am aware of regex limitations, but it would be nice if I could just make it not matching anything if there is more then one pair of brackets.
So, example behavior is good enough:
( test1 test2 ) => test1 test2
( test1 t3() test2 ) => (test1 t3() test2)
EDIT:
I would like to keep trimming trailing white spaces inside removed brackets.
You can use this recursive regex based code that will work with nested brackets also. Only condition is that brackets should be balanced.
$arr = array('Foo ( test1 test2 )', 'Bar ( test1 t3() test2 )', 'Baz ((("Fdsfds")))');
foreach($arr as $str)
echo "'$str' => " .
preg_replace('/ \( \s* ( ( [^()]*? | (?R) )* ) \s* \) /x', '$1', $str) . "\n";
OUTPUT:
'Foo ( test1 test2 )' => 'Foo test1 test2'
'Bar ( test1 t3() test2 )' => 'Bar test1 t3() test2'
'Baz ((("Fdsfds")))' => 'Baz (("Fdsfds"))'
Try this
$result = preg_replace('/\(([^)(]+)\)/', '$1', $subject);
Update
\(([^\)\(]+)\)(?=[^\(]+\()
RegEx explanation
"
\( # Match the character “(” literally
( # Match the regular expression below and capture its match into backreference number 1
[^\)\(] # Match a single character NOT present in the list below
# A ) character
# A ( character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\) # Match the character “)” literally
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
[^\(] # Match any character that is NOT a ( character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\( # Match the character “(” literally
)
"
You may want this (As I guess it is what you want originally):
$result = preg_replace('/\(\s*(.+)\s*\)/', '$1', $subject);
This would get
"(test1 test2)" => "test1 test2"
"(test1 t3() test2)" => "test1 t3() test2"
"( test1 t3(t4) test2)" => "test1 t3(t4) test2"

Categories