Using regex variables in preg_match_all - php

I'm building a pseudo-variable parser in PHP to allow clean and simple syntax in views, and have implemented an engine for if-statements. Now I want to be able to use nested if-statements and the simplest, most elegant solution I thought of was to use identation as the marker for a block.
So this is basicly the layout I'm looking for in the view:
{if x is empty}
{if y is array}
Hello World
{endif}
{endif}
The script would find the first if-statement and match it with the endif on the same depth. If it evaluates to true the inside block will be parsed as well.
Now, I'm having trouble setting up the regular expression to use depth in the following code:
preg_match_all('|([\t ]?){if (.+?) is (.+?)}(.+?){endif}|s', $template, $match);
Basically I want the first match in ([\t ]?) to be placed before {endif} using some kind of variable, and make sure that the statement won't be complete if there is no matching {endif} on the same depth as the {if ...}.
Can you help me complete it?

You cannot in general use regular expressions for this problem, because the language you've defined is not regular (since it requires counting occurrences of {if} and {endif}).
What you've got is a variant of the classic matching parentheses problem.
You'd be better off using some kind of Finite-state machine to keep track of occurrences of {if} and {endif}.

\1 will contain your first match, so you could do this:
preg_match_all('|([\t ]?){if (.+?) is (.+?)}(.+?)\1{endif}|s', $template, $match);
However, this won't necessarily match the correct occurrence of {endif}, only the next occurrence that is preceded by the correct number of tabs.

if you're using explicit "endif" statement, your blocks are already closed, and there's no need to do any special indentation, just match '/{if.+?}(.+?){endif}/'.1 If, on the other side, you what python-alike indent blocks, you can get rid of {endif} (and brackets) and only match indent levels
if condition
this line is in the if block
this one too
this one not
your expression will be like this
/([\t]+)if(.+)\n((\1[\t]+.+\n)+)/
"condition" will be match 2 and statement body match 3
1 actually this should be something like /{if}((.(?!{if))+?){endif}/s to handle nested if's properly.

Related

Regex match if statements that contain assignment operator

My requirement is to match all if statements that erroneously contain assignment operator instead of an equality operator (==).
I am sure my regex lacks a lot but the first problem that I notice is that I am having trouble containing the regex to stop searching after an if statement closes e.g. ') {'.
if(.*?)(\w+)\s*=\s*(\w+)\)
Note: lines containing >= or != must not be matched
Try this
if\s*\([^{]*\s*\w+\s*=\s*\w+\s*[^{]*\)
How it's made:
You want to identify something like num=3 inside the parenthesis of an if-statement. You can do this by using if\(\w+=\w+\). However, there's a small problem with this. It fails in case of white space. So, it won't recognize num = 3 or apples = 8.
In order to make sure that our regex doesn't fail in case of white space, we modify it to: if\s*\(\s*\w+\s*=\s*\w+\s*\). Now, it can work even with white spaces. That's good; but there's still a small problem that needs to be addressed (like you mentioned in the comments). What if we have an if-statement like if ( apples==3 and mangoes=5 or oranges==4 )? Well, our regex will fail.
To address this issue, modify your regex to if\s*\([^{]*\s*\w+\s*=\s*\w+\s*[^{]*\). I have only added [^{]* to our regex. It tells the regex engine that there could be any number of characters (except {) before and after \s*\w+\s*=\s*\w+\s*.
That's it. You have a simple regex that matches all the if-statements which erroneously contain assignment operator instead of an equality operator.
Just my five cents:
if(\s)*(\(){1}(.)+[^!=<>](=){1,1}?[^=]
This expression don't match "!=" "<=" ">=" that are valid logical operations.

Regular expression matching unescaped paired brackets and nested function calls

Here's the problem: given a string like
"<p>The price for vehicles {capitalize(pluralize(vehicle))} is {format_number(value, language)}</p><span>{employee_name}</span><span>\{do not parse me}</span>"
I need (1) a regex pattern in PHP that matches all values between un-escaped pairs of curly brackets and (2) another regex pattern that matches function calls and nested function calls (once the first pattern is matched). Of course, if I could use one regex only for both tasks that would be awesome.
By the way, I can't use Smarty, Twig or any other library - that's the only reason I have to build a parsing mechanism myself.
Thanks a ton!
Solution
(1) A partial solution for the first problem can be found here. Basically, we use the regex (?={((?:[^{}]++|{(?1)})++)}) and find the matches at index 1 of the resulting array.
It's partial because I still need to find a way of ignoring escaped braces, though.
(2) I'm considering the use of recursive regex, as suggested by Mario. Will post result here.
Thanks, guys!
Copying the answer from the comments in order to remove this question from the "Unanswered" filter:
This appears to be what you're looking for:
(?<!\\){([^(){}]+)\(((?:[^(){}]+|\(((?2))\))*)\)}
Link: http://www.regex101.com/r/uI4qN0
~ answer per Jerry
Note: For comparison - simply ignoring escaped braces is accomplished by adding a negative lookahead (?<!\\) to the beginning of the expression, like so:
(?<!\\)(?={((?:[^{}]++|{(?1)})++)})

Replacing unique nested statement with regex or alternatives

From the countless questions posted I know it's not possible/advisable to use regex to replace nested statements.
I'm wondering if it makes any difference in a case where statements are unique:
[if #test]TEST[if #second]SECOND[/if][/if]
I've gotten it work when the end blocks are also unique, which I know is clumsy workaround:
[if #test]TEST[if #second]SECOND[/if #second][/if #test]
$pattern = '%\[if #'.$dynamic.'.*?\](.*?)\[/if #'.$dynamic.'\]%s'; //Works with above
Is it possible to use regex without the end block being unique? Are there alternatives to regex that would accomplish this?
I would like to parse something like: [if #test]TEST[if #second]SECOND[/if][/if] with arbitrary nesting levels. If regex is not practical, can anyone suggest viable alternative in PHP?
In a proper solution you should tokenize the string in to its basic components such as tags, comments, text and whatever else you have there. This step can be done with regex, and produces a flat list of tokens. Next you go trough the tokens building a parse tree with all the structure and details needed. (Both steps can be combined and done in one pass as well.)
That way everything is under your control and you don't need to reparse any part of the code.
On the other hand it can be done with regex, but then you are more limited, and you need to reparse the nested parts of the code for every added depth.
Since you asked for a regex, here is one to match such nested ifs:
~
\[if\ #(\w++)]
(
(?>
(?: (?!\[if\ #\w++]|\[/if]) . )++
|
(?R)
)*+
)
\[/if]
~xs

Matching Parentheses with Division Operator - Regex

Examples:
input: (n!/(1+n))
output: frac{n!}{1+n}
input: ((n+11)!/(n-k)^(-1))
output: frac{(n+11)!}{(n-k)^(-1)}
input: (9/10)
output: frac{9}{10}
input: ((n+11)!/(n-k)^(-1))+(11)/(2)
output: frac{(n+11)!}{(n-k)^(-1)}+(11)/(2)
The following regex works if there are no sub parentheses.
\(([^\/\)]*)\/([^\)]*)\)
The following does matching parentheses
#\((([^()]++|\((?:[^()]++|(?R))+\))+)\)#
I just can not figure out how to "combine" them - write a single regex to handle division and balanced parentheses.
I think something like this should work:
((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?)\/((?1))
Now, this probably isn't perfect, but here's the idea:
The first group, $1, is ((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?), which is:
(A literal) or (a balanced expression wrapped in parentheses), followed by an optional operator and another balanced expression if needed.
Writing it that way, we can use (?1) anywhere in the regex to refer to another balanced expression.
Working example: http://ideone.com/PNLOD
I know you've already accepted a regexp, but the real answer to what you're trying to do is a proper parser... and no, you don't need to code this from scratch or reinvent the wheel. While a lot of people hate phpclasses, the evalMath class that you can find there is incredibly useful for parsing and evaluating mathematical formulae. See my answer to a similar question for details of how it can be used and extended.

Parsing nested structures in PHP with preg_match

Hello I want to make something like a meta language which gets parsed and cached to be more performant. So I need to be able to parse the meta code into objects or arrays.
Startidentifier: {
Endidentifier: }
You can navigate through objects with a dot(.) but you can also do arithmetic/logic/relational operations.
Here is an example of what the meta language looks like:
{mySelf.mother.job.jobName}
or nested
{mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}
or with operations
{obj.val * (otherObj.intVal + myObj.longVal) == 1200}
or more logical
{obj.condition == !myObj.otherCondition}
I think most of you already understood what i want. At the moment I can do only simple operations(without nesting and with only 2 values) but nesting for getting values with dynamic property names works fine. also the text concatination works fine
e.g. "Hello {myObj.name}! How are you {myObj.type}?".
Also the possibility to make short if like (condition) ? (true-case) : (false-case) would be nice but I have no idea how to parse all that stuff. I am working with loops with some regex at the moment but it would be probably faster and even more maintainable if I had more in regex.
So could anyone give me some hints or want to help me? Maybe visit the project site to understand what I need that for: http://sourceforge.net/projects/blazeframework/
Thanks in advance!
It is non-trivial to parse a indeterminate number of matching braces using regular expressions, because in general, either you will match too much or too little.
For instance, consider Hello {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}? to use two examples from your input in the same string:
If you use the first regular expression that probably comes to mind \{.*\} to match braces, you will get one match: {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size} This is because by default, regular expressions are greedy and will match as much as possible.
From there, we can try to use a non-greedy pattern \{.*?\}, which will match as little as possible between the opening and closing brace. Using the same string, this pattern will result in two matches: {myObj.name} and {mySelf.{myObj.{keys["ObjProps"][0]}. Obviously the second is not a full expression, but a non-greedy pattern will match as little as possible, and that is the smallest match that satisfies the pattern.
PCRE does allow recursive regular expressions, but you're going to end up with a very complex pattern if you go down that route.
The best solution, in my opinion, would be to construct a tokenizer (which could be powered by regex) to turn your text into an array of tokens which can then be parsed.
maybe have a look at the PREG_OFFSET_CAPTURE flag!?

Categories