Regular expression for template engine?

Regular expression for template engine? - php

I'm learning about regular expressions and want to write a templating engine in PHP.
Consider the following "template":
<!DOCTYPE html>
<html lang="{{print("{hey}")}}" dir="{{$dir}}">
<head>
<meta charset="{{$charset}}">
</head>
<body>
{{$body}}
{{}}
</body>
</html>
I managed to create a regex that will find anything except for {{}}.
Here's my regex:
{{[^}]+([^{])*}}
There's just one problem. How do I allow the literal { and } to be used within {{}} tags?
It will not find {{print("{hey}")}}.
Thanks in advance.

This is a pattern to match the content inside double curly brackets:
$pattern = <<<'LOD'
~
(?(DEFINE)
(?<quoted>
' (?: [^'\\]+ | (?:\\.)+ )++ ' |
" (?: [^"\\]+ | (?:\\.)+ )++ "
)
(?<nested>
{ (?: [^"'{}]+ | \g<quoted> | \g<nested> )*+ }
)
)
{{
(?<content>
(?:
[^"'{}]+
| \g<quoted>
| \g<nested>
)*+
)
}}
~xs
LOD;
Compact version:
$pattern = '~{{((?>[^"\'{}]+|((["\'])(?:[^"\'\\\]+|(?:\\.)+|(?:(?!\3)["\'])+)++\3)|({(?:[^"\'{}]+|\g<2>|(?4))*+}))*+)}}~s';
The content is in the first capturing group, but you can use the named capture 'content' with the detailed version.
If this pattern is longer, it allows all that you want inside quoted parts including escaped quotes, and is faster than a simple lazy quantifier in much cases.
Nested curly brackets are allowed too, you can write {{ doThat(){ doThis(){ }}}} without problems.
The subpattern for quotes can be written like this too, avoiding to repeat the same thing for single and double quotes (I use it in compact version)
(["']) # the quote type is captured (single or double)
(?: # open a group (for the various alternatives)
[^"'\\]+ # all characters that are not a quote or a backslash
| # OR
(?:\\.)+ # escaped characters (with the \s modifier)
| #
(?!\g{-1})["'] # a quote that is not the captured quote
)++ # repeat one or more times
\g{-1} # the captured quote (-1 refers to the last capturing group)
Notice: a backslash must be written \\ in nowdoc syntax but \\\ or \\\\ inside single quotes.
Explanations for the detailed pattern:
The pattern is divided in two parts:
the definitions where i define named subpatterns
the whole pattern itself
The definition section is useful to avoid to repeat always the same subpattern several times in the main pattern or to make it more clear. You can define subpatterns that you will use later in this space: (?(DEFINE)....)
This section contains 2 named subpatterns:
quoted : that contains the description of quoted parts
nested : that describes nested curly brackets parts
detail of nested
(?<nested> # open the named group "nested"
{ # literal {
## what can contain curly brackets? ##
(?> # open an atomic* group
[^"'{}]+ # all characters one or more times, except "'{}
| # OR
\g<quoted> # quoted content, to avoid curly brackets inside quoted parts
# (I call the subpattern I have defined before, instead of rewrite all)
| \g<nested> # OR curly parts. This is a recursion
)*+ # repeat the atomic group zero or more times (possessive *)
} # literal }
) # close the named group
(* more informations about atomic groups and possessive quantifiers)
But all of this are only definitions, the pattern begins really with: {{
Then I open a named capture group (content) and I describe what can be found inside, (nothing new here).
I use to modifiers, x and s. x activates the verbose mode that allows to put freely spaces in the pattern (useful to indent). s is the singleline mode. In this mode, the dot can match newlines (it can't by default). I use this mode because there is a dot in the subpattern quoted.

You can just use "." instead of the character classes. But you then have to make use of non-greedy quantifiers:
\{\{(.+?)\}\}
The quantifier "+?" means it will consume the least necessary number of characters.
Consider this example:
<table>
<tr>
<td>{{print("{first name}")}}</td><td>{{print("{last name}")}}</td>
</tr>
</table>
With a greedy quantifier (+ or *), you'd only get one result, because it sees the first {{ and then the .+ consumes as many characters as it can as long as the pattern is matched:
{{print("{first name}")}}</td><td>{{print("{last name}")}}
With a non-greedy one (+? or *?) you'll get the two as separate results:
{{print("{first name}")}}
{{print("{last name}")}}

Make you regex less greedy using {{(.*?)}}.

I figured it out. Don't ask me how.
{{[^{}]*("[^"]*"\))?(}})
This will match pretty much anything.. like for example:
{{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}

Related

How to simplify this regex to avoid recursion?

Regex:
(?|`(?>[^`\\]|\\.|``)*`|'(?>[^'\\]|\\.|'')*'|"(?>[^"\\]|\\.|"")*"|(\?{1,2})|(:{1,2})([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*))
Example input:
INSERT INTO xyz WHERE
a=?
and b="what?"
and ??="cheese"
and `col?`='OK'
and ::col='another'
and last!=:least
https://regex101.com/r/HnTVXx/6
It should match ?, ??, :xyz and ::xyz but not if they are inside of a backquoted-string, double-quoted string, or single-quoted string.
When I try running this in PHP with a very large input I get PREG_RECURSION_LIMIT_ERROR from preg_last_error().
How can I simplify this regex pattern so that it doesn't do so much recursion?
Here's some test code that shows the error in PHP using Niet's optimized regex: https://3v4l.org/GdtmP Error code 6 is PREG_JIT_STACKLIMIT_ERROR. The other one I've seen is 3=PREG_RECURSION_LIMIT_ERROR

The general idea of "match this thing, but not in this condition" can be achieved with this pattern:
(don't match this(*SKIP)(*FAIL)|match this)
In your case, you'd want something like...
(
(['"`]) # capture this quote character
(?:\\.|(?!\1).)*+ # any escaped character, or
# any character that isn't the captured one
\1 # the captured quote again
(*SKIP)(*FAIL) # ignore this
|
\?\?? # one or two question marks
|
::?\w+ # word characters marked with one or two colons
)x
https://regex101.com/r/HnTVXx/7

Same idea to skip quoted parts (the (*SKIP)(*F) combo), but also 2 techniques to reduce the regex engine work:
the first character discrimination
the unrolled pattern
These 2 techniques have something in common: limiting the cost of alternations.
The first character discrimination is useful when your pattern starts with an alternation. The problem with an alternation at the beginning is that each branch should be tested so that a position where the pattern fails is identified. Since most of the time, there are many failing positions in a string, discarding them quickly constitutes a significant improvement.
For instance, something like: "...|'...|`...|:... can also be written like this:
(?=["'`:])(?:"...|'...|`...|:...)
or
["'`:](?:(?<=")...|(?<=')...|(?<=`)...|(?<=:)...)
This way, each position that doesn't start with one of these characters ["'`:] is immediately rejected with the first token without to test each branch.
The unrolled pattern consists to rewrite something like: " (?:[^"\\]|\\.)* " into:
" [^"\\]* (?: \\. [^"\\]* )* "
Note that this design eliminates the alternation and reduces the number of steps drastically:basicunrolled
Using these 2 techniques, your pattern can be written like this:
~
[`'"?:]
(?:
(?<=`) [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` (*SKIP) (*F)
|
(?<=') [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' (*SKIP) (*F)
|
(?<=") [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " (*SKIP) (*F)
|
(?<=\?) \??
|
(?<=:) :? ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~x
demo
Other way: instead of using an alternation (improved or not) at the beginning, you can build a pattern that matches all the string with contiguous results. The general design is:
\G (all I don't want) (*SKIP) \K (what I am looking for)
\G is an anchor that matches either the position after the previous result or the start of the string. Starting a pattern with it ensures that all the matches are contiguous. In this situation (at the beginning of the pattern and in factor to the whole pattern), you can also replace it with the A modifier.
That gives:
~
[^`'"?:]*
(?:
` [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` [^`'"?:]*
|
' [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' [^`'"?:]*
|
" [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " [^`'"?:]*
)*
\K # only the part of the match after this position is returned
(*SKIP) # if the next subpattern fails, the contiguity is broken at this position
(?:
\?{1,2}
|
:{1,2} ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~Ax
demo

PHP PCRE - Regex upgraded now failing (Catastrophic backtracking) + optimization

I first posted this question :
Regex matching nested beginning and ending tags
It was answered perfectly by Wiktor Stribiżew. Now, I wanted to upgrade my Regex expression so that my parameters supports a JSON object (or almost, because lonely '{' and '[' aren't supported).
I have two expressions: one for paired tags, one for lonely tags. I first use the paired one, when all replacements done, I execute the lonely one. The modified lonely one works fine on regex101.com (https://www.regex101.com/r/HIEQZk/9), but the paired one tells me "castatrophic backtracking" (https://www.regex101.com/r/HIEQZk/8) even though in PHP in doesn't crash.
So could anyone help me optimize/fix this fairly huge regex.
Even though there seems to be useless escaping, it is because begin/end markers and the splitter can be customized and thus have to be escaped. (The paired one is not as escaped because it is not the one generated by PHP, but the one made by Wiktor Stribiżew with the modifications I did.)
The only part that I think that shall be optimized/fixed is the "parameters" group which I just modified to support JSON objects. (Tests of these can be seen in the earlier versions of the same regex101 url. The ones here are with a real HTML to parse.)
Lonely expression
~
\{\{ #Instruction start
([^\^\{\}]+) # (Group 1) Instruction name OR variable to reach if nothing else after then
(?:
\^
(?:([^\\^\{\}]*)\^)? #(Group 2) Specific delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
\}\} #Instruction end
~xg
Paired expression
~{{ # Opening tag start
(\w+) # (Group 1) Tag name
(?: # Not captured group for optional parameters
(?: # Not captured group for optional delimiter
\^ # Aux delimiter
([^^\{\}]?) # (Group 2) Specific delimiter
)?
\^ # Aux delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
}} # Opening tag end
( # (Group 4)
(?>
(?R) # Repeat the whole pattern
| # or match all that is not the opening/closing tag
[^{]*(?:\{(?!{/?\1[^\{\}]*}})[^{]*)*
)* # Zero or more times
)
{{/\1}} # Closing tag
~ix

Try to replace your (?: non-capturing groups with (?> atomic groups to prevent/reduce backtracking wherever possible. Those are non capturing as well. And/or experiment with possessive quantifiers while watching the stepscounter/debugger in regex101.
Wherever you don't want the engine to go back and try different other ways.
This is your updated demo where I just changed the first (?: to (?>

If-else in recursive regex not working as expected

I am using a regex to parse some BBCode, so the regex has to work recursively to also match tags inside others. Most of the BBCode has an argument, and sometimes it's quoted, though not always.
A simplified equivalent of the regex I'm using (with html style tags to reduce the escaping needed) is this:
'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'
However, if I have a test string that looks like this:
<"a">Content<a>Also Content</a></a>
it only matches the <a>Also Content</a> because when it tries to match from the first tag, the first matching group, \1, is set to ", and this is not overwritten when the regex is run recursively to match the inner tag, which means that because it isn't quoted, it doesn't match and that regex fails.
If instead I consistently either use or don't use quotes, it works fine, but I can't be sure that that will be the case with the content that I have to parse. Is there any way to work around this?
The full regex that I'm using, to match [spoiler]content[/spoiler], [spoiler=option]content[/spoiler] and [spoiler="option"]content[/spoiler], is
"~\[spoiler\s*+ #Match the opening tag
(?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
(?:\ *(?:\n|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
((?:[^\[\n]++ #Capture all characters until the closing tag
|\n(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
|\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
|(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
# without messing up nesting
\n? #Get rid of an extra new line before the closing tag if necessary
\[/spoiler] #match the closing tag
~xi"
There are a couple of other bugs with it as well though.

The simplest solution is to use alternatives instead:
<(?:a|"a")>
([^<]++ | (?R))*
</a>
But if you really don't want to repeat that a part, you can do the following:
<("?)a\1>
([^<]++ | (?R))*
</a>
Demo
I've just put the conditional ? inside the group. This time, the capturing group always matches, but the match can be empty, and the conditional isn't necessary anymore.
Side note: I've applied a possessive quantifier to [^<] to avoid catastrophic backtracking.
In your case I believe it's better to match a generic tag than a specific one. Match all tags, and then decide in your code what to do with the match.
Here's a full regex:
\[
(?<tag>\w+) \s*
(?:=\s*
(?:
(?<quote>["']) (?<arg>.{0,100}?) \k<quote>
| (?<arg>[^\]]+)
)
)?
\]
(?<content>
(?:[^[]++ | (?R) )*+
)
\[/\k<tag>\]
Demo
Note that I added the J option (PCRE_DUPNAMES) to be able to use (?<arg>...) twice.

(?(1)...) only checks if the group 1 has been defined, so the condition is true once the group is defined the first time. That is why you obtain this result (it is not related with the recursion level or whatever).
So when <a> is reached in the recursion, the regex engine try to match <a"> and fails.
If you want to use a conditional statement, you can write <("?)a(?(1)\1)> instead. In this way the group 1 is redefined each times.
Obviously you can write your pattern in a more efficient way like this:
~<(?:a|"a")>[^<]*+(?:(?R)[^<]*)*+</a>~
For your particular problem, I will use this kind of pattern to match any tags:
$pattern = <<<'EOD'
~
\[ (?<tag>\w+) \s*
(?:
= \s*
(?| " (?<option>[^"]*) " | ' ([^']*) ' | ([^]\s]*) ) # branch reset feature
)?
\s* ]
(?<content> [^[]*+ (?: (?R) [^[]*)*+ )
\[/\g{tag}]
~xi
EOD;
If you want to impose a specific tag at the ground level, you can add (?(R)|(?=spoiler\b)) before the tag name.

Trouble with regular expression matching in lexer

I am in the process of making a templating engine that is quite complex as it will feature typical constructs in programming languages such as if statements and loops.
Currently, I am working on the lexer, which I believe, deals with the job of converting a stream of characters into tokens. What I want to do is capture certain structures within the HTML document, which later can be worked on by the parser.
This is an example of the syntax:
<head>
<title>Template</title>
<meta charset="utf-8">
</head>
<body>
<h1>{{title}}</h1>
<p>This is also being matched.</p>
{{#myName}}
<p>My name is {{myName}}</p>
{{/}}
<p>This content too.</p>
{{^myName}}
<p>I have on name.</p>
{{/}}
<p>No matching here...</p>
</body>
I am trying to scan only for everything between the starting '{{' characters and ending '}}' characters. So, {{title}} would be one match, along with {{#myName}}, the text and content leading up to {{/}}, this should then be the second match.
I am not particularly the best at regular expressions, and I am pretty sure it is an issue with the pattern I have devised, which is this:
({{([#\^]?)([a-zA-Z0-9\._]+)}}([\w\W]+){{\/?}})
I read this as match two { characters, then either # or ^ any words containing uppercase or lowercase letters, along with any digits, dots, or underscores. Match anything that comes after the closing }} characters, until either the {{/}} characters are met, but the /}} part is optional.
The problem is visible in the link below. It is matching text that is not within the {{ and }} blocks. I am wondering it is linked to the use of the \w and \W, because if I specify specifically what characters I want to match against in the set, it seems to then work.
The regular expression test is here. I did look at the regular expression is the shared list for capturing all text that isn't HTML, and I noticed it is using lookaheads which I just cannot grasp, or understand why they would help me.
Can someone help me by pointing out the problem with the regular expression, or whether or not I am going the wrong way about it in terms of creating the lexer?
I hope I've provided enough information, and thank you for any help!

Your pattern doesn't work because [\w\W]+ take all possible characters until the last {{/}} of your string. Quantifiers (i.e. +, *, {1,3}, ?) are greedy by default. To obtain a lazy quantifier you must add a ? after it: [\w\W]+?
A pattern to deal with nested structures:
$pattern = <<<'LOD'
~
{{
(?| # branch reset group: the interest of this feature is that
# capturing group numbers are the same in all alternatives
([\w.]++)}} # self-closing tag: capturing group 1: tag name
| # OR
([#^][\w.]++)}} # opening tag: capturing group 1: tag name
( # capturing group 2: content
(?> # atomic group: three possible content type
[^{]++ # all characters except {
| # OR
{(?!{) # { not followed by another {
| # OR
(?R) # an other tag is met, attempt the whole pattern again
)* # repeat the atomic group 0 or more times
) # close the second capturing group
{{/}} # closing tag
) # close the branch reset group
~x
LOD;
preg_match_all($pattern, $html, $matches);
var_dump($matches);
To obtain all nested levels you can use this pattern:
$pattern = <<<'LOD'
~
(?=( # open a lookahead and the 1st capturing group
{{
(?|
([\w.]++)}}
|
([#^][\w.]++)}}
( # ?R was changed to ?1 because I don't want to
(?>[^{]++|{(?!{)|(?1))* # repeat the whole pattern but only the
) # subpattern in the first capturing group
{{/}}
)
) # close the 1st capturing group
) # and the lookahead
~x
LOD;
preg_match_all($pattern, $html, $matches);
var_dump($matches);
This pattern is only the first pattern enclosed in a lookahead and a capturing group. This construct allows to capture overlapping substrings.
More informations about regex features used in these two patterns:
possessive quantifiers ++atomic groups (?>..)lookahead (?=..), (?!..)branch reset group (?|..|..)recursion (?R), (?1)

Remove comments from JSON data

I need to remove all /*...*/ style comments from JSON data. How do I do it with regular expressions so that string values like this
{
"propName": "Hello \" /* hi */ there."
}
remain unchanged?

You must first avoid all the content that is inside double quotes using the backtrack control verbs SKIP and FAIL (or a capture)
$string = <<<'LOD'
{
"propName": "Hello \" /* don't remove **/ there." /*this must be removed*/
}
LOD;
$result = preg_replace('~"(?:[^\\\"]+|\\\.)*+"(*SKIP)(*FAIL)|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '',$string);
// The same with a capture:
$result = preg_replace('~("(?:[^\\\"]+|\\\.)*+")|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '$1',$string);
Pattern details:
"(?:[^\\\"]+|\\\.)*+"
This part describe the possible content inside quotes:
" # literal quote
(?: # open a non-capturing group
[^\\\"]+ # all characters that are not \ or "
| # OR
\\\.)*+ # escaped char (that can be a quote)
"
Then You can make this subpattern fails with (*SKIP)(*FAIL) or (*SKIP)(?!). The SKIP forbid the backtracking before this point if the pattern fails after. FAIL forces the pattern to fail. Thus, quoted part are skipped (and can't be in the result since you make the subpattern fail after).
Or you use a capturing group and you add the reference in the replacement pattern.
/\*(?:[^*]+|\*+(?!/))*+\*/
This part describe content inside comments.
/\* # open the comment
(?:
[^*]+ # all characters except *
| # OR
\*+(?!/) # * not followed by / (note that you can't use
# a possessive quantifier here)
)*+ # repeat the group zero or more times
\*/ # close the comment
The s modifier is used here only when a backslash is before a newline inside quotes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression for template engine? - php

Make you regex less greedy using {{(.*?)}}.

I figured it out. Don't ask me how. {{[^{}]("[^"]"\))?(}}) This will match pretty much anything.. like for example: {{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}

Related

How to simplify this regex to avoid recursion?

PHP PCRE - Regex upgraded now failing (Catastrophic backtracking) + optimization

If-else in recursive regex not working as expected

Trouble with regular expression matching in lexer

Remove comments from JSON data

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression for template engine? - php

Make you regex less greedy using {{(.*?)}}.

I figured it out. Don't ask me how. {{[^{}]*("[^"]*"\))?(}}) This will match pretty much anything.. like for example: {{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}

Related

How to simplify this regex to avoid recursion?

PHP PCRE - Regex upgraded now failing (Catastrophic backtracking) + optimization

If-else in recursive regex not working as expected

Trouble with regular expression matching in lexer

Remove comments from JSON data

Categories

Resources

I figured it out. Don't ask me how. {{[^{}]("[^"]"\))?(}}) This will match pretty much anything.. like for example: {{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}