Getting contents of square brackets with regex, including nested ones

Getting contents of square brackets with regex, including nested ones - php

Is there any way to have this:
[one[two]][three]
And extract this with a regex?
Array (
[0] => one[two]
[1] => two
[2] => three

For PHP you can use recursion in regular expressions that nearly gives you what you want:
$s = 'abc [one[two]][three] def';
$matches = array();
preg_match_all('/\[(?:[^][]|(?R))*\]/', $s, $matches);
print_r($matches);
Result:
Array
(
[0] => Array
(
[0] => [one[two]]
[1] => [three]
)
)
For something more advanced than this, it's probably best not to use regular expressions.

You can apply the regex with a loop, for example,
Match all \[([^\]]*)\].
For each match, replace \x01 with [ and \x02 with ] and output the result.
Replace all of \[([^\]]*)\] into \x01$1\x02 (warning: assumes \x01 and \x02 are not used by the string.)
Repeat 1 until there's no match.
But I'd write a string scanner for this problem :).

#!/usr/bin/perl
use Data::Dumper;
#a = ();
$re = qr/\[((?:[^][]|(??{$re}))*)\](?{push#a,$^N})/;
'[one[two]][three]' =~ /$re*/;
print Dumper \#a;
# $VAR1 = [
# 'two',
# 'one[two]',
# 'three'
# ];
Not exactly what you asked for, but it's kinda doable with (ir)regular expression extensions. (Perl 5.10's (?PARNO) can replace the usage of (??{CODE}).)

In Perl 5.10 regex, you can use named backtracking and a recursive subroutine to do that:
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\[ # match an opening brace
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^\[\]]+ # one or more non braces
) # end non backtracking group
| # ... or ...
(?1) # recurse to bracket 1 and try it again
)* # 0 or more times.
) # end buffer 2
\] # match a closing brace
) # end capture buffer one
/x;
print "\n\n";
sub strip {
my ($str) = #_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$striped\n";
strip($striped) if $striped=~/\[/;
return $striped;
}
}
$str="[one[two]][three][[four]five][[[six]seven]eight]";
print "start=$str\n";
while ($str=~/$re/g) {
strip($1) ;
}
Output:
start=[one[two]][three][[four]five][[[six]seven]eight]
one[two]
two
three
[four]five
four
[[six]seven]eight
[six]seven
six

Related

How to get the text between any number of parenthesis?

Suppose I have a document where I want to capture the strings that have parenthesis before or after.
Example: This [is] a {{test}} sentence. The (((end))).
So basically I want to get the words is, test and end.
Thanks in advance.

According to your condition "strings that have parenthesis before or after" - any word could be proceeded with OR only followed by some type of parentheses:
$text = 'This [is] a {{test}} sentence. The (((end))). Some word))';
preg_match_all('/(?:\[+|\{+|\(+)(\w+)|(\w+)(?:\]+|\}+|\)+)/', $text, $m);
$result = array_filter(array_merge($m[1],$m[2]));
print_r($result);
The output:
Array
(
[0] => is
[1] => test
[2] => end
[7] => word
)

The below code works for me.
<?php
$in = "This [is] a {{test}} sentence. The (((end))).";
preg_match_all('/(?<=\(|\[|{)[^()\[\]{}]+/', $in, $out);
echo $out[0][0]."<br>".$out[0][1]."<br>".$out[0][2];
?>

Your regex could be:
[\[{(]((?(?<=\[)[^\[\]]+|(?(?<={)[^{}]+|[^()]+)))
Explanation: the if-then-else construction is needed to make sure that an opening '{' is matched against a closing '}', etc.
[\[{(] # Read [, { or (
((?(?<=\[) # Lookbehind: IF preceding char is [
[^\[\]]+ # THEN read all chars unequal to [ and ]
| # ELSE
(?(?<={) # IF preceding char is {
[^{}]+ # THEN read all chars unequal to { and }
| # ELSE
[^()]+))) # read all chars unequal to ( and )
See regex101.com

Try this Regex:
(?<=\(|\[|{)[^()\[\]{}]+
>>>Demo<<<
OR this one:
(?<=\(|{|\[)(?!\(|{|\[)[^)\]}]+
>>>Demo<<<
Explantion(for the 1st regex):
(?<=\(|\[|{) - Positive lookbehind - looks for a zero-length match just preceeded by a { or [ or a (
[^()\[\]{}]+ - one or more occurences of any character which is not amoong the following characters: [, (, {, }, ), ]
Explanation(for 2nd Regex):
(?<=\(|\[|{) - Positive lookbehind - looks for a zero-length match just preceeded by a { or [ or a (
(?!\(|{|\[) - Negative lookahead - In the previous step, it found the position which is just preceded by an opening bracket. This piece of regex verifies that it is not followed by another opening bracket. Hence, matching the position just after the innermost opening bracket - (, { or [.
[^)\]}]+ - One or more occurrences of characters which are not among these closing brackets - ], }, )

PHP: Parse comma-delimited string outside single and double quotes and parentheses

I've found several partial answers to this question, but none that cover all my needs...
I am trying to parse a user generated string as if it were a series of php function arguments to determine the number of arguments:
This string:
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
will be inserted as the arguments of a function:
function my_function( [insert string here] ){ ... }
I need to parse the string on the commas, taking into account single- and double-quotes, parentheses, and escaped quotes and parentheses to create an array:
array(4) {
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
}
Any help with a regular expression or parser function to accomplish this is appreciated!

It isn't possible to solve this problem with a classical csv tool since there is more than one character able to protect parts of the string.
Using preg_split is possible but will result in a very complicated and inefficient pattern. So the best way is to use preg_match_all. There are however several problems to solve:
as needed, a comma enclosed in quotes or parenthesis must be ignored (seen as a character without special meaning, not as a delimiter)
you need to extract the params, but you need to check if the string has the good format too, otherwise the match results may be totally false!
For the first point, you can define subpatterns to describe each cases: the quoted parts, the parts enclosed between parenthesis, and a more general subpattern able to match a complete param and that uses the two previous subpatterns when needed.
Note that the parenthesis subpattern needs to refer to the general subpattern too, since it can contain anything (and commas too).
The second point can be solved using the \G anchor that ensures that all matchs are contiguous. But you need to be sure that the end of the string has been reached. To do that, you can add an optional empty capture group at the end of the main pattern that is created only if the anchor for the end of the string \z succeeds.
$subject = <<<'EOD'
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
EOD;
$pattern = <<<'EOD'
~
# named groups definitions
(?(DEFINE) # this definition group allows to define the subpatterns you want
# without matching anything
(?<quotes>
' [^'\\]*+ (?s:\\.[^'\\]*)*+ ' | " [^"\\]*+ (?s:\\.[^"\\]*)*+ "
)
(?<brackets> \( \g<content> (?: ,+ \g<content> )*+ \) )
(?<content> [^,'"()]*+ # ' # (<-- comment for SO syntax highlighting)
(?:
(?: \g<brackets> | \g<quotes> )
[^,'"()]* # ' #
)*+
)
)
# the main pattern
(?: # two possible beginings
\G(?!\A) , # a comma contiguous to a previous match
| # OR
\A # the start of the string
)
(?<param> \g<content> )
(?: \z (?<check>) )? # create an item "check" when the end is reached
~x
EOD;
$result = false;
if ( preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER) &&
isset(end($matches)['check']) )
$result = array_map(function ($i) { return $i['param']; }, $matches);
else
echo 'bad format' . PHP_EOL;
var_dump($result);
demo

You could split the argument string at ,$ and then append $ back the array values:
$args_array = explode(',$', $arg_str);
foreach($args_array as $key => $arg_raw) {
$args_array[$key] = '$'.ltrim($arg_raw, '$');
}
print_r($args_array);
Output:
(
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
)

If you want to use a regex, you can use something like this:
(.+?)(?:,(?=\$)|$)
Working demo
Php code:
$re = '/(.+?)(?:,(?=\$)|$)/';
$str = "\$arg1,\$arg2='ABC,DEF',\$arg3=\"GHI\",JKL\",\$arg4=array(1,'2)',\"3\"),\")\n";
preg_match_all($re, $str, $matches);
Match information:
MATCH 1
1. [0-5] `$arg1`
MATCH 2
1. [6-21] `$arg2='ABC,DEF'`
MATCH 3
1. [22-39] `$arg3="GHI\",JKL"`
MATCH 4
1. [40-67] `$arg4=array(1,'2)',"3\"),")`

Negation of a string in a regex

I realise that something similar has been asked before, but I can't seem to fit the solution to what I am trying to do, so please don't just think this is a dupe.
I have a string in the style {block:string}contents{/block:string}, which can be matched fairly easily with {block:([a-z_-\s]+)}.*{/block:\1}
What I want to do is modify the inner .* part so that it does not match any string that has a {block:[a-z_-\s]+} between it, that is all {block}{/block} that have a {block} inside them should not be matched.
Thanks!

Try
{block:([a-z_-\s]+)}[^{]*(?!{block:([a-z_-\s]+)}.*{\block:\2})[^}]*{/block:\1}
I am pretty mediocre at regex, but the negative lookahead bounded by the [^{]* and [^}]* statements should keep your matches tag-free.

Compressed: m~\{block:([a-z\s_-]+)\}(?:(?!\{/?block:\1\}).)*\{/block:\1\}~xs
Example in Perl:
$_ = '{block:string}conte{block:string}nts{/block:string}{/block:string}';
if ( m~ # match operator
\{block: ([a-z\s_-]+) \} # opening block structure and capt grp 1
(?: # begin non capt grp
(?! \{/?block: \1 \} ) # negative lookahead, don't want backreffed
# open or closed block struct
. # ok, grab this character
)* # end group, do 0 or more times (greedy)
\{/block: \1 \} # closing block structure matching grp 1
~xs ) # modifiers: expanded, include newlines
{
print "matched '$&'\n";
}
Output:
matched '{block:string}nts{/block:string}'

<?php
$ptn = "%(?:{block:[a-z_\s-]+})(?![^}]*?{block:).*?{/block:[a-z_\s-]+}%";
$str = "... your content here ...";
preg_match_all($ptn, $str, $matches);
print_r($matches);
?>
For example:
$str = "{block:string}test2{/block:string} {block:string}contents{block:string}{block:string}test3{/block:string}{/block:string}{/block:string} sdf ";
Would produce:
Array
(
[0] => Array
(
[0] => {block:string}test2{/block:string}
[1] => {block:string}test3{/block:string}
)
)

How do I match nested braces using regular expressions in PHP?

I have an LaTeX document I want to match. And I need a RegEx match that matches the following:
\ # the backslash in the beginning
[a-zA-Z]+ #a word
(\{.+\})* # any amount of {something}
However, and her is the catch;
In the last line, it 1. needs to be greedy and 2. needs to have a matching number of {} inside itself.
Meaning if I have the string \test{something\somthing{9}}
it would match the whole. And it needs to be in that order ({}). So that it doesn't match the following:
\LaTeX{} is a document preparation system for the \TeX{}
just
\LaTeX{}
and
\TeX{}
Help anyone? Maybe someone have an better idea for matching? Should I not use regular expressions?

This can be done with recursion:
$input = "\LaTeX{} is a document preparation system for the \TeX{}
\latex{something\somthing{9}}";
preg_match_all('~(?<token>
\\\\ # the slash in the beginning
[a-zA-Z]+ #a word
(\{[^{}]*((?P>token)[^{}]*)?\}) # {something}
)~x', $input, $matches);
This correctly matches \LaTeX{}, \TeX{}, and \latex{something\somthing{9}}

PHP could be used since it supports recursive regex-matching. But, as I said, if you have comments in your LaTeX-like strings that can have { or } in them, this will fail.
A demo:
$text = 'This is a \LaTeX{ foo { bar { ... } baz test {} done } } document
preparation system for the \TeX{a{b{c}d}e{f}g{h}i}-y people out there';
preg_match_all('/\\\\[A-Za-z]+(\{(?:[^{}]|(?1))*})/', $text, $matches, PREG_SET_ORDER);
print_r($matches);
which produces:
Array
(
[0] => Array
(
[0] => \LaTeX{ foo { bar { ... } baz test {} done } }
[1] => { foo { bar { ... } baz test {} done } }
)
[1] => Array
(
[0] => \TeX{a{b{c}d}e{f}g{h}i}
[1] => {a{b{c}d}e{f}g{h}i}
)
)
A quick explanation:
\\\\ # the literal '\'
[A-Za-z]+ # one or more letters
( # start capture group 1 <-----------------+
\{ # the literal '{' |
(?: # start non-capture group A |
[^{}] # any character other than '{' and '}' |
| # OR |
(?1) # recursively match capture group 1 ---+
) # end non-capture group A
* # non-capture group A zero or more times
} # the literal '}'
) # end capture group 1

Unfortunately, I believe this is impossible. Bracket matching (detecting properly paired, nested brackets) is commonly used as an example of a problem that cannot be solved with a finite state machine, such as a regular expression parser. You could do it with a context free grammar, but that's just not how regex works. Your best solution is to use a regex like {*[^{}]*}* for the initial check, and then another short script to check whether it's an even number.
In conclusion: don't try and do it with only regex. This is not a problem that can be solved with regex alone.

Regex Help with manipulating string

i am seriously struggling to get my head around regex.
I have a sring with "iPhone: 52.973053,-0.021447"
i want to extract the two numbers after the colon into two seperate strings so delimited by the comma.
Can anyone help me? Cheers

Try:
preg_match_all('/\w+:\s*(-?\d+\.\d+),(-?\d+\.\d+)/',
"iPhone: 52.973053,-0.021447 FOO: -1.0,-1.0",
$matches, PREG_SET_ORDER);
print_r($matches);
which produces:
Array
(
[0] => Array
(
[0] => iPhone: 52.973053,-0.021447
[1] => 52.973053
[2] => -0.021447
)
[1] => Array
(
[0] => FOO: -1.0,-1.0
[1] => -1.0
[2] => -1.0
)
)
Or just:
preg_match('/\w+:\s*(-?\d+\.\d+),(-?\d+\.\d+)/',
"iPhone: 52.973053,-0.021447",
$match);
print_r($match);
if the string only contains one coordinate.
A small explanation:
\w+ # match a word character: [a-zA-Z_0-9] and repeat it one or more times
: # match the character ':'
\s* # match a whitespace character: [ \t\n\x0B\f\r] and repeat it zero or more times
( # start capture group 1
-? # match the character '-' and match it once or none at all
\d+ # match a digit: [0-9] and repeat it one or more times
\. # match the character '.'
\d+ # match a digit: [0-9] and repeat it one or more times
) # end capture group 1
, # match the character ','
( # start capture group 2
-? # match the character '-' and match it once or none at all
\d+ # match a digit: [0-9] and repeat it one or more times
\. # match the character '.'
\d+ # match a digit: [0-9] and repeat it one or more times
) # end capture group 2

A solution without using regular expressions, using explode() and stripos() :) :
$string = "iPhone: 52.973053,-0.021447";
$coordinates = explode(',', $string);
// $coordinates[0] = "iPhone: 52.973053"
// $coordinates[1] = "-0.021447"
$coordinates[0] = trim(substr($coordinates[0], stripos($coordinates[0], ':') +1));
Assuming that the string always contains a colon.
Or if the identifier before the colon only contains characters (not numbers) you can do also this:
$string = "iPhone: 52.973053,-0.021447";
$string = trim($string, "a..zA..Z: ");
//$string = "52.973053,-0.021447"
$coordinates = explode(',', $string);

Try:
$string = "iPhone: 52.973053,-0.021447";
preg_match_all( "/-?\d+\.\d+/", $string, $result );
print_r( $result );

I like #Felix's non-regex solution, I think his solution for the problem is more clear and readable than using a regex.
Don't forget that you can use constants/variables to change the splitting by comma or colon if the original string format is changed.
Something like
define('COORDINATE_SEPARATOR',',');
define('DEVICE_AND_COORDINATES_SEPARATOR',':');

$str="iPhone: 52.973053,-0.021447";
$s = array_filter(preg_split("/[a-zA-Z:,]/",$str) );
print_r($s);

An even more simple solution is to use preg_split() with a much more simple regex, e.g.
$str = 'iPhone: 52.973053,-0.021447';
$parts = preg_split('/[ ,]/', $str);
print_r($parts);
which will give you
Array
(
[0] => iPhone:
[1] => 52.973053
[2] => -0.021447
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Getting contents of square brackets with regex, including nested ones - php

Is there any way to have this: [one[two]][three] And extract this with a regex? Array ( [0] => one[two] [1] => two [2] => three

Related

How to get the text between any number of parenthesis?

PHP: Parse comma-delimited string outside single and double quotes and parentheses

Negation of a string in a regex

How do I match nested braces using regular expressions in PHP?

Regex Help with manipulating string

Categories

Resources