Regex: select between mathced brackets : templating engine - php

I was playing with PHP today creating a small language (just for fun), but I encountered a problem:
How can I select between matching brackets?
My template string:
for(items as item){ // this bracket
if(some_condition){
// do stuff
} // my regex stops here
} // and this bracket
I used this regex [\w]+\([ \w]+\){([\s\n\r\t/\w(){}]+?)}, but it stop when finds the first closed bracket.
How can I make it select everything between his matching brackets?:
for(items as item){ // this bracket
if(some_condition){
// do stuff
} // my regex stops here
} // and this bracket
Then I will compile what's in the for separately.
PS: Please don't post comments like "don't bother doing this" or "don't reinvent the wheel". It is just for learning purposes.

You can use recursion:
$code = '
for(items as item) {
if(some_condition) {
while stuff {
hi
}
}
done
}
';
$re = '/{ ( ( [^{}] | (?R) ) * ) }/x';
preg_match_all($re, $code, $m);
print_r($m[1][0]);
This prints
if(some_condition) {
while stuff {
hi
}
}
done
that is, the inner block has been detected correctly.
That said, regular expressions is a wrong tool for parsing formal languages (they are fine for tokenizing though). For example, the above will break hopelessly once you add a string literal containing "{":
for(items as item){
echo "hi there :{ ";
}
What you actually need is a parser, either crafted manually (good learning exercise!) or generated (see here for options).

You could try the below regex which allows another } bracket to be matched.
[\w]+\([ \w]+\){([\s\n\r\t\/\w(){}]+?}[\s\n\r\t\/\w(){}]+?)}
DEMO

Related

Regex to match blocks of text containing substrings wrapped in triple square brackets

Given a whole block of text:
Welcome to [[[RegExr v2.0 by gskinner.com]]]
Edit the Expression & Text to see matches. Roll over matches or the
expression for details. Undo mistakes with ctrl-z. Save & Share
expressions with friends or the Community. [[[A full Reference & Help is
available in the Library, or watch the video Tutorial.
Sample text for]]] testing: abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789 _+-.,!##$%^&*();/|<>"' 12345
-98.7 3.141 .6180 9,000 +42
555.123.4567 +1-(800)-555-2468 foo#demo.net bar.ba#test.co.uk www.demo.com
I need a regex that can validate that all open triple square brackets '[[[' in the string are paired up and closed ']]]'. Nested brackets and strings that begin with ']]]' or end with '[[[' should return false.
I know there are ways to loop through the string and evaluate this, but I might be potentially dealing with very large strings of text and was hoping a regex would be faster/better for performance.
Thanks.
I've come up with the following solution using the pattern: /[\[]{3}[^\[\]]*[\]]{3}/. Unfortunately the third $text case will still return false, so I'm working on that. The regex pattern can be seen in action here.
$text = 'Some [[[default]]] [[[text]]] here'; //valid
//$text = 'Some [[[default text [[[here]]]'; //invalid
//$text = 'Some [[[default text [here]]]'; //invalid
// Get the number of opening and closing brackets
$open_bracket_count = substr_count($text, '[[[');
$close_bracket_count = substr_count($text, ']]]');
// Check if number of '[[[' is same as ']]]'
if ($open_bracket_count === $close_bracket_count)
{
// Match valid bracketed substrings in the text
$validation_pattern = '/[\[]{3}[^\[\]]*[\]]{3}/';
$valid_match_count = preg_match_all($validation_pattern, $text, $valid_matches);
// Valid matches should equal the number of substrings attempting to be wrapped in brackets
if ($valid_match_count === $open_bracket_count)
{
return true;
}
else
{
return false;
}
}
// If not equal, we know right away the string contains invalid brackets
else
{
return false;
}

regular expression to find all starting brace has an ending brace

I need an regular expression to find all starting brace has an ending brace.
Suppose
([(([[(([]))]]))]) -- this one will return true. but
[](()()[[]])[][[([]) --- this one will return false
for this, I've tried below:-
function check($str) {
$output = "";
$pattern = "/(\{[^}]*)([^{]*\})/im";
preg_match($pattern, $str, $match);
print_r($match[0]);
}
assert(check("[](()()[[]])[][[([])") === FALSE);
any help please...
The easiest way to do this (in my opinion) would be to implement a stack data structure and pass through your string. Essentially something like so:
Traverse the string left to right
If you find an opening parenthesis, add it to the stack
else (you find a closing parenthesis) make sure that the top most item in the stack is the same type of parenthesis (so make sure that if you found a }, the top most item in the stack is {). This should help scenarios where you have something like so: ({)}. If it matches, pop from the stack.
If you repeat the above operation throughout the entire string, you should end up with an empty stack. This would mean that you have managed to match all open parenthesis with a close parenthesis.
You can use this:
$pattern = '~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~';
(?1) is a reference to the subpattern (not the matched content) in the capture group 1. Since I put it in the capture group 1 itself, I obtain a recursion.
I added anchors for the start ^ and the end $ of the string to be sure to check all the string.
Note: If you need to check a string that contains not only brackets, you can replace each (?1)* with:
(?>[^][}{)(]++|(?1))*
Note 2: If you want that an empty string return true, you must replace the last quantifier + with *
Working example:
function check($str, $display = false) {
if (preg_match('~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~', $str, $match)) {
if ($display) echo $match[0];
return true;
}
elseif (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
if ($display) echo "The recursion limit has been reached\n";
return -1;
} else return false;
}
assert(check(')))') === false);
check('{[[()()]]}', true);
var_dump(check('{[[()()]]}'));

Regular expression text between brackets

I have a problem where I have no idea how to solve this and if regular expression are the best way.
My idea is to get the name,variables,content of functions in a file.
This is my regular expression:
preg_match_all('/function (.*?)\((.*?)\)(.*?)\{(.*?)\}/s',$content,$funcs,PREG_SET_ORDER);
And I have this testfile:
function testfunc($text)
{
if ($text)
{
return 1;
}
return 0;
}
Of course I will get everything until "}" before return 0;
Is there a way to get everything in the function so find the right "}".
Contrary to many beliefs PHP (PCRE) has something called Recursive Pattern Regex that lets you find matching nested brackets. Consider this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ \{ ( (?: [^{}]* | (?0) )+ ) \} /x', $str, $m) )
echo $m[0];
OUTPUT:
{
if ($text) {
return 1;
}
return 0;
}
UPDATE: To capture function name and arguments as well try this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ (function [^{]+ ) ( \{ (?: [^{}]* | (?-1) )* \} ) /x', $str, $m) )
print_r ($m);
OUTPUT
Array
(
[0] => function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
[1] => function testfunc($text)
[2] => {
if ($text) {
return 1;
}
return 0;
}
)
Working Online Demo: http://ideone.com/duQw9c
Regular expressions are not the best tool for that job. Parsers are.
No doubt you can use regexp callbacks to eventually manage what you intend, but this would be ungodly obfuscated and fragile.
A parser can easily do the same job. Better still, if you are planning on parsing PHP with PHP, you can use the Zend parser that does the job for you.
Not in general, (you can of course define a regex for two levels deep parsing that would be something like function (.*)\((.*)\)(.*)\{([^}]*(\{[^}]*\})*)\} but since you can nest such structures arbitrarily deep, you will eventually run out of regex :D ). One needs a context free grammar to do this.
You can generate such grammar parsers for instance with Yacc, Bison, Gppg,...
Furthermore you don't need to state .*?, .* means zero or more times, .+ means one time or more.
Is there a way to get everything in the function so find the right "}".
Short Answer: no.
Long Answer:
This can not be handled with a single Expression. { and } can also appear inside a method body, making it hard to find the correct ending }. You would need to process (iterative or recursive) ALL pairs of {} and manually sort out ALL Pairs, that have a "method name" in front of it.
This, however isn't simple either, because you need to exclude all the Statements, that look like a function but are valid inside the method body.
I don't think, that Regex is the way to go for such a task. EVEN if you would manage to create all the required Regex-Pattern - Performance would be worse compared to any dedicated parser.

PHP Regular expression to capture code

I have been trying to capture code blocks in a similar fashion to wiki tags:
{{code:
code goes here
}}
Example code is shown below,
$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach ($strings as $testcase) {
if (ctype_alnum($testcase)) {
echo "It is The string $testcase consists of all letters or digits.\n";
} else {
echo "The string $testcase does not consist of all letters or digits.\n";
}
}
Essentially I want to capture anything between the {{..}}. There are multiple blocks like this embedded in an HTML page.
I would appreciate any help.
Well to start off, regex is not a good way to solve this problem. The right approach is to write a parser that understands language semantics and can tease out the subtleties. Having said that, if you still want a quick and dirty regex based approach that will work 99.99% of the time but has a couple of acknowledged bugs (see end of answer), Here you go:
You can use preg_match_all(). Here is a proof of concept:
$input = "
<html>
<head>
<title>{{code:echo 'Hello World';}}</title>
</head>
<body>
<h1>{{code:\$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach (\$strings as \$testcase) {
if (ctype_alnum(\$testcase)) {
echo \"It is The string \$testcase consists of all letters or digits.\\n\";
} else {
echo \"The string $testcase does not consist of all letters or digits.\\n\";
}
}
}}</h1>
</body>
</html>
";
$matches = array();
preg_match_all('/{{code:([^\x00]*?)}}/', $input, $matches);
print_r($matches[1]);
Outputs the following:
Array
(
[0] => echo 'Hello World';
[1] => $strings = array('AbCd1zyZ9', 'foo!#');
foreach ($strings as $testcase) {
if (ctype_alnum($testcase)) {
echo "It is The string $testcase consists of all letters or digits.\n";
} else {
echo "The string does not consist of all letters or digits.\n";
}
}
)
Be careful. There are some edge case bugs involving early termination by encountering }} within a "code" block:
If }} appears in a quoted string, the regex matches too early
If } is the last character of your "code" block and it's immediately followed by }}, you'll lose the closing } from your code block.
As I've said in the comments, Asaph's answer is a good solid regex, but breaks down when }} is contained within the code block. Hopefully this won't be a problem, but as there is a possibility of it, it would be best make your regex a little more expansive. If we can assume that any }} appearing between two single-quotes does not signify the end of the code, as in Asaph's example of <div>{{code:$myvar = '}}';}}</div>, we can expand our regex a bit:
{{code:((?:[^']*?'[^']*?')*?[^']*?)}}
[^']*?' looks for a set of non-' characters, followed by a single quote, and [^']*?'[^']*?' looks for two of them in succession. This "swallows" strings like '}}'. We lazily look for any number of these strings, then the rest of any non-string code with [^']*?, and finally our ending }}.
This allows us to match the entire string {{code:$myvar = '}}';}} rather than just {{code:$myvar = '}}.
There are still problems with this method, however. Escaping a quote within a string, such as in {{code:$myvar = '\'}}\'';}} will not work, as we will "swallow" '\' first, and end with the }} immediately following. It may be possible to determine these escaped single-quotes as well, or to add in support for double-quoted strings, but you need to ask yourself at what point using a code-parser is a better idea.
See the entire Regex in action here. (If it doesn't match anything at first, just click the window.)
how can I use the result to say place
it in new ,<div>
Use the replace function:
preg_replace($expression, "<div>$0</div>", $input)
$0 inserts the entire match, and will place it between a new <div> block. Alternatively, if you just want the actual source code, use $1, as we captured the source code in a separate capture group.
Again, see the replacement here.
I went deeper down the rabbit hole...
{{code:((?:(?:[^']|\\')*?(?<!\\)'(?:[^']|\\')*?(?<!\\)')*?(?:[^']|\\')*?)}}
This won't break with escaped single-quotes, and correctly matches {{code:$myvar = '\'}}\'';}}.
Ta-da.
use
preg_match_all("/{{(.)*}}/", $text, $match)
where text is the text that might contain code
this captures anything between {{ }}

Can Regex/preg_replace be used to add an anchor tag to a keyword?

I would like to be able to switch this...
My sample [a id="keyword" href="someURLkeyword"] test keyword test[/a] link this keyword here.
To...
My sample [a id="keyword" href="someURLkeyword"] test keyword test[/a] link this [a href="url"]keyword[/a] here.
I can't simply replace all instances of "keyword" because some are used in or within an existing anchor tag.
Note: Using PHP5 preg_replace on Linux.
Using regular expressions may not be the best way to solve this problem, but here is a quick solution:
function link_keywords($str, $keyword, $url) {
$keyword = preg_quote($keyword, '/');
$url = htmlspecialchars($url);
// Use split the string on all <a> tags, keeping the matched delimiters:
$split_str = preg_split('#(<a\s.*?</a>)#i', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
// loop through the results and process the sections between <a> tags
$result = '';
foreach ($split_str as $sub_str) {
if (preg_match('#^<a\s.*?</a>$#i', $sub_str)) {
$result .= $sub_str;
} else {
// split on all remaining tags
$split_sub_str = preg_split('/(<.+?>)/', $sub_str, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($split_sub_str as $sub_sub_str) {
if (preg_match('/^<.+>$/', $sub_sub_str)) {
$result .= $sub_sub_str;
} else {
$result .= preg_replace('/'.$keyword.'/', '$0', $sub_sub_str);
}
}
}
}
return $result;
}
The general idea is to split the string into links and everything else. Then split everything outside of a link tag into tags and plain text and insert links into the plain text. That will prevent [p class="keyword"] from being expanded to [p class="[a href="url"]keyword[/a]"].
Again, I would try to find a simpler solution that does not involve regular expressions.
You can't do this with regular expressions alone. Regular expressions are context free -- they simply match a pattern, without regard to the surroundings. To do what you want, you need to parse the source out into an abstract representation, and then transfor it into your target output.

Categories