Regular expression text between brackets - php

I have a problem where I have no idea how to solve this and if regular expression are the best way.
My idea is to get the name,variables,content of functions in a file.
This is my regular expression:
preg_match_all('/function (.*?)\((.*?)\)(.*?)\{(.*?)\}/s',$content,$funcs,PREG_SET_ORDER);
And I have this testfile:
function testfunc($text)
{
if ($text)
{
return 1;
}
return 0;
}
Of course I will get everything until "}" before return 0;
Is there a way to get everything in the function so find the right "}".

Contrary to many beliefs PHP (PCRE) has something called Recursive Pattern Regex that lets you find matching nested brackets. Consider this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ \{ ( (?: [^{}]* | (?0) )+ ) \} /x', $str, $m) )
echo $m[0];
OUTPUT:
{
if ($text) {
return 1;
}
return 0;
}
UPDATE: To capture function name and arguments as well try this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ (function [^{]+ ) ( \{ (?: [^{}]* | (?-1) )* \} ) /x', $str, $m) )
print_r ($m);
OUTPUT
Array
(
[0] => function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
[1] => function testfunc($text)
[2] => {
if ($text) {
return 1;
}
return 0;
}
)
Working Online Demo: http://ideone.com/duQw9c

Regular expressions are not the best tool for that job. Parsers are.
No doubt you can use regexp callbacks to eventually manage what you intend, but this would be ungodly obfuscated and fragile.
A parser can easily do the same job. Better still, if you are planning on parsing PHP with PHP, you can use the Zend parser that does the job for you.

Not in general, (you can of course define a regex for two levels deep parsing that would be something like function (.*)\((.*)\)(.*)\{([^}]*(\{[^}]*\})*)\} but since you can nest such structures arbitrarily deep, you will eventually run out of regex :D ). One needs a context free grammar to do this.
You can generate such grammar parsers for instance with Yacc, Bison, Gppg,...
Furthermore you don't need to state .*?, .* means zero or more times, .+ means one time or more.

Is there a way to get everything in the function so find the right "}".
Short Answer: no.
Long Answer:
This can not be handled with a single Expression. { and } can also appear inside a method body, making it hard to find the correct ending }. You would need to process (iterative or recursive) ALL pairs of {} and manually sort out ALL Pairs, that have a "method name" in front of it.
This, however isn't simple either, because you need to exclude all the Statements, that look like a function but are valid inside the method body.
I don't think, that Regex is the way to go for such a task. EVEN if you would manage to create all the required Regex-Pattern - Performance would be worse compared to any dedicated parser.

Related

PHP regex pattern match recursive

I have this in a function which is supposed to replace any sequence of parentheses with what is enclosed in it like (abc) becomes abc any where it appears even recursively because parens can be nested.
$return = preg_replace_callback(
'|(\((.+)\))+|',
function ($matches) {
return $matches[2];
},
$s
);
when the above regex is fed this string "a(bcdefghijkl(mno)p)q" as input it returns "ap)onm(lkjihgfedcbq". This shows the regex is matched once. What can I do to make it continue to match even inside already made matches and produce this `abcdefghijklmnopq'"
To match balanced parenthetical substrings you may use a well-known \((?:[^()]++|(?R))*\) pattern (described in Matching Balanced Constructs), inside a preg_replace_callback method, where the match value can be further manipulated (just remove all ( and ) symbols from the match that is easy to do even without a regex:
$re = '/\((?:[^()]++|(?R))*\)/';
$str = 'a(bcdefghijkl(mno)p)q((('; // Added three ( at the end
$result = preg_replace_callback($re, function($m) {
return str_replace(array('(',')'), '', $m[0]);
}, $str);
echo $result; // => abcdefghijklmnopq(((
See the PHP demo
To get overlapping matches, you need to use a known technique, capturing inside a positive lookahead, but you won't be able to perform two operations at once (replacing and matching), you can run matching first, and then replace:
$re = '/(?=(\((?:[^()]++|(?1))*\)))/';
$str = 'a(bcdefghijkl(mno)p)q(((';
preg_match_all($re, $str, $m);
print_r($m[1]);
// => Array ( [0] => (bcdefghijkl(mno)p) [1] => (mno) )
See the PHP demo.
Try this one,
preg_match('/\((?:[^\(\)]*+|(?0))*\)/', $str )
https://regex101.com/r/NsQSla/1
It will match everything inside of the ( ) as long as they are matched pairs.
Example
(abc) (abc (abc))
will have the following matches
Match 1
Full match 0-5 `(abc)`
Match 2
Full match 6-17 `(abc (abc))`
It is slightly unclear exactly what the postcondition of the algorithm is supposed to be. It seems to me that you are wanting to strip out matching pairs of ( ). The assumption here is that unmatched parentheses are left alone (otherwise you just strip out all of the ('s and )'s).
So I guess this means the input string a(bcdefghijkl(mno)p)q becomes abcdefghijklmnopq but the input string a(bcdefghijkl(mno)pq becomes a(bcdefghijklmnopq. Likewise an input string (a)) would become a).
It may be possible to do this using pcre since it does provide some non-regular features but I'm doubtful about it. The language of the input strings is not regular; it's context-free. What #ArtisticPhoenix's answer does is match complete pairs of matched parentheses. What it does not do is match all nested pairs. This nested matching is inherently non-regular in my humble understanding of language theory.
I suggest writing a parser to strip out the matching pairs of parentheses. It gets a little wordy having to account for productions that fail to match:
<?php
// Parse the punctuator sub-expression (i.e. anything within ( ... ) ).
function parse_punc(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$inner = parse_punc_seq($tokens,$iter);
if (!isset($tokens[$iter]) || $tokens[$iter] != ')') {
// Leave unmatched open parentheses alone.
$inner = "($inner";
}
$iter += 1;
return $inner;
}
// Parse a sequence (inside punctuators).
function parse_punc_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter];
if ($tok == ')') {
return;
}
$iter += 1;
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_punc_seq($tokens,$iter);
return $tok;
}
// Parse a sequence (outside punctuators).
function parse_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter++];
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_seq($tokens,$iter);
return $tok;
}
// Wrapper for parser.
function parse(array $tokens) {
$iter = 0;
return strval(parse_seq($tokens,$iter));
}
// Grab input from stdin and run it through the parser.
$str = trim(stream_get_contents(STDIN));
$tokens = preg_split('/([\(\)])/',$str,-1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
var_dump(parse($tokens));
I know this is a lot more code than a regex one-liner but it does solve the problem as I understand it. I'd be interested to know if anyone can solve this problem with a regular expression.

Regex: select between mathced brackets : templating engine

I was playing with PHP today creating a small language (just for fun), but I encountered a problem:
How can I select between matching brackets?
My template string:
for(items as item){ // this bracket
if(some_condition){
// do stuff
} // my regex stops here
} // and this bracket
I used this regex [\w]+\([ \w]+\){([\s\n\r\t/\w(){}]+?)}, but it stop when finds the first closed bracket.
How can I make it select everything between his matching brackets?:
for(items as item){ // this bracket
if(some_condition){
// do stuff
} // my regex stops here
} // and this bracket
Then I will compile what's in the for separately.
PS: Please don't post comments like "don't bother doing this" or "don't reinvent the wheel". It is just for learning purposes.
You can use recursion:
$code = '
for(items as item) {
if(some_condition) {
while stuff {
hi
}
}
done
}
';
$re = '/{ ( ( [^{}] | (?R) ) * ) }/x';
preg_match_all($re, $code, $m);
print_r($m[1][0]);
This prints
if(some_condition) {
while stuff {
hi
}
}
done
that is, the inner block has been detected correctly.
That said, regular expressions is a wrong tool for parsing formal languages (they are fine for tokenizing though). For example, the above will break hopelessly once you add a string literal containing "{":
for(items as item){
echo "hi there :{ ";
}
What you actually need is a parser, either crafted manually (good learning exercise!) or generated (see here for options).
You could try the below regex which allows another } bracket to be matched.
[\w]+\([ \w]+\){([\s\n\r\t\/\w(){}]+?}[\s\n\r\t\/\w(){}]+?)}
DEMO

regular expression to extract a part of string

I have following format of transaction from core banking system
This is a <test> and only <test> hope <u> understand
from where i want
<test><test><u> (along with <>)
with simple substring i can do that , but it will be too slow .. is there any way to capture a text between < and > using regex functions?
The easiest I can think of is to use preg_match_all() and then join() the results together to form the final string:
function get_bracketed_words($str)
{
if (preg_match_all('/<[a-z]+>/', $str, $matches)) {
return join('', $matches[0]);
}
return '';
}
If you use this, it should not be too slow (Perl code as an example here):
while (my $line = <FILE>) {
my ($request) = ($line =~ /RequestArray:(.*)/);
next unless $request;
# here, you can split $requests to sub-pieces using another regex
# ...
}

Get more backreferences from regexp than parenthesis

Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.

Regular Expression Help - Brackets within brackets

I'm trying to develop a function that can sort through a string that looks like this:
Donny went to the {park|store|{beach with friends|beach alone}} so he could get a breath of fresh air.
What I intend to do is search the text recursively for {} patterns where there is no { or } inside the {}, so only the innermost sandwiched text is selected, where I will then run a php to array the contents and select one at random, repeating process until the whole string has been parsed, showing a complete sentence.
I just cannot wrap my head around regular expressions though.
Appreciate any help!
Don't know about maths theory behind this ;-/ but in practice that's quite easy. Try
$text = "Donny went to the {park|store|{beach with friends|beach alone}} so he could get a breath of fresh air. ";
function rnd($matches) {
$words = explode('|', $matches[1]);
return $words[rand() % count($words)];
}
do {
$text = preg_replace_callback('~{([^{}]+)}~', 'rnd', $text, -1, $count);
} while($count > 0);
echo $text;
Regexes are not capable of counting and therefore cannot find matching brackets reliably.
What you need is a grammar.
See this related question.
$str="Donny went to the {park|store|{beach {with friends}|beach alone}} so he could get a breath of fresh air. ";
$s = explode("}",$str);
foreach($s as $v){
if(strpos($v,"{")!==FALSE){
$t=explode("{",$v);
print end($t)."\n";
}
}
output
$ php test.php
with friends
Regular expressions don't deal well with recursive stuff, but PHP does:
$str = 'Donny went to the {park|store|{beach with friends|beach alone}} so he could get a breath of fresh air.';
echo parse_string($str), "\n";
function parse_string($string) {
if ( preg_match('/\{([^{}]+)\}/', $string, $matches) ) {
$inner_elements = explode('|', $matches[1]);
$random_element = $inner_elements[array_rand($inner_elements)];
$string = str_replace($matches[0], $random_element, $string);
$string = parse_string($string);
}
return $string;
}
You could do this with a lexer/parser. I don't know of any options in PHP (but since there are XML parsers in PHP, there are no doubt generic parsers). On the other hand, what you're asking to do is not too complicated. Using strings in PHP (substring, etc.) you could probably do this in a few recursive functions.
You will then finally have created a MadLibz generator in PHP with a simple grammar. Pretty cool.

Categories