Parsing nested structures in PHP with preg_match - php

Hello I want to make something like a meta language which gets parsed and cached to be more performant. So I need to be able to parse the meta code into objects or arrays.
Startidentifier: {
Endidentifier: }
You can navigate through objects with a dot(.) but you can also do arithmetic/logic/relational operations.
Here is an example of what the meta language looks like:
{mySelf.mother.job.jobName}
or nested
{mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}
or with operations
{obj.val * (otherObj.intVal + myObj.longVal) == 1200}
or more logical
{obj.condition == !myObj.otherCondition}
I think most of you already understood what i want. At the moment I can do only simple operations(without nesting and with only 2 values) but nesting for getting values with dynamic property names works fine. also the text concatination works fine
e.g. "Hello {myObj.name}! How are you {myObj.type}?".
Also the possibility to make short if like (condition) ? (true-case) : (false-case) would be nice but I have no idea how to parse all that stuff. I am working with loops with some regex at the moment but it would be probably faster and even more maintainable if I had more in regex.
So could anyone give me some hints or want to help me? Maybe visit the project site to understand what I need that for: http://sourceforge.net/projects/blazeframework/
Thanks in advance!

It is non-trivial to parse a indeterminate number of matching braces using regular expressions, because in general, either you will match too much or too little.
For instance, consider Hello {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}? to use two examples from your input in the same string:
If you use the first regular expression that probably comes to mind \{.*\} to match braces, you will get one match: {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size} This is because by default, regular expressions are greedy and will match as much as possible.
From there, we can try to use a non-greedy pattern \{.*?\}, which will match as little as possible between the opening and closing brace. Using the same string, this pattern will result in two matches: {myObj.name} and {mySelf.{myObj.{keys["ObjProps"][0]}. Obviously the second is not a full expression, but a non-greedy pattern will match as little as possible, and that is the smallest match that satisfies the pattern.
PCRE does allow recursive regular expressions, but you're going to end up with a very complex pattern if you go down that route.
The best solution, in my opinion, would be to construct a tokenizer (which could be powered by regex) to turn your text into an array of tokens which can then be parsed.

maybe have a look at the PREG_OFFSET_CAPTURE flag!?

Related

Regular expression matching unescaped paired brackets and nested function calls

Here's the problem: given a string like
"<p>The price for vehicles {capitalize(pluralize(vehicle))} is {format_number(value, language)}</p><span>{employee_name}</span><span>\{do not parse me}</span>"
I need (1) a regex pattern in PHP that matches all values between un-escaped pairs of curly brackets and (2) another regex pattern that matches function calls and nested function calls (once the first pattern is matched). Of course, if I could use one regex only for both tasks that would be awesome.
By the way, I can't use Smarty, Twig or any other library - that's the only reason I have to build a parsing mechanism myself.
Thanks a ton!
Solution
(1) A partial solution for the first problem can be found here. Basically, we use the regex (?={((?:[^{}]++|{(?1)})++)}) and find the matches at index 1 of the resulting array.
It's partial because I still need to find a way of ignoring escaped braces, though.
(2) I'm considering the use of recursive regex, as suggested by Mario. Will post result here.
Thanks, guys!
Copying the answer from the comments in order to remove this question from the "Unanswered" filter:
This appears to be what you're looking for:
(?<!\\){([^(){}]+)\(((?:[^(){}]+|\(((?2))\))*)\)}
Link: http://www.regex101.com/r/uI4qN0
~ answer per Jerry
Note: For comparison - simply ignoring escaped braces is accomplished by adding a negative lookahead (?<!\\) to the beginning of the expression, like so:
(?<!\\)(?={((?:[^{}]++|{(?1)})++)})

Replacing unique nested statement with regex or alternatives

From the countless questions posted I know it's not possible/advisable to use regex to replace nested statements.
I'm wondering if it makes any difference in a case where statements are unique:
[if #test]TEST[if #second]SECOND[/if][/if]
I've gotten it work when the end blocks are also unique, which I know is clumsy workaround:
[if #test]TEST[if #second]SECOND[/if #second][/if #test]
$pattern = '%\[if #'.$dynamic.'.*?\](.*?)\[/if #'.$dynamic.'\]%s'; //Works with above
Is it possible to use regex without the end block being unique? Are there alternatives to regex that would accomplish this?
I would like to parse something like: [if #test]TEST[if #second]SECOND[/if][/if] with arbitrary nesting levels. If regex is not practical, can anyone suggest viable alternative in PHP?
In a proper solution you should tokenize the string in to its basic components such as tags, comments, text and whatever else you have there. This step can be done with regex, and produces a flat list of tokens. Next you go trough the tokens building a parse tree with all the structure and details needed. (Both steps can be combined and done in one pass as well.)
That way everything is under your control and you don't need to reparse any part of the code.
On the other hand it can be done with regex, but then you are more limited, and you need to reparse the nested parts of the code for every added depth.
Since you asked for a regex, here is one to match such nested ifs:
~
\[if\ #(\w++)]
(
(?>
(?: (?!\[if\ #\w++]|\[/if]) . )++
|
(?R)
)*+
)
\[/if]
~xs

Matching Parentheses with Division Operator - Regex

Examples:
input: (n!/(1+n))
output: frac{n!}{1+n}
input: ((n+11)!/(n-k)^(-1))
output: frac{(n+11)!}{(n-k)^(-1)}
input: (9/10)
output: frac{9}{10}
input: ((n+11)!/(n-k)^(-1))+(11)/(2)
output: frac{(n+11)!}{(n-k)^(-1)}+(11)/(2)
The following regex works if there are no sub parentheses.
\(([^\/\)]*)\/([^\)]*)\)
The following does matching parentheses
#\((([^()]++|\((?:[^()]++|(?R))+\))+)\)#
I just can not figure out how to "combine" them - write a single regex to handle division and balanced parentheses.
I think something like this should work:
((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?)\/((?1))
Now, this probably isn't perfect, but here's the idea:
The first group, $1, is ((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?), which is:
(A literal) or (a balanced expression wrapped in parentheses), followed by an optional operator and another balanced expression if needed.
Writing it that way, we can use (?1) anywhere in the regex to refer to another balanced expression.
Working example: http://ideone.com/PNLOD
I know you've already accepted a regexp, but the real answer to what you're trying to do is a proper parser... and no, you don't need to code this from scratch or reinvent the wheel. While a lot of people hate phpclasses, the evalMath class that you can find there is incredibly useful for parsing and evaluating mathematical formulae. See my answer to a similar question for details of how it can be used and extended.

REGEX (PCRE) matching only if zero or once

I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.

Regular expression to match a certain HTML element

I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

Categories