Match the body of a function using Regex - php

Given a dummy function as such:
public function handle()
{
if (isset($input['data']) {
switch($data) {
...
}
} else {
switch($data) {
...
}
}
}
My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...}.
I've come across recursive patterns but couldn't get my head around a regex that would match the function's body.
I've tried the following (no recursion):
$pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)"
'[\n\s]?[\t\s]*'. // regardless of the indentation preceding the {
'{([^{}]*)}/'; // find everything within braces.
preg_match($pattern, $contents, $match);
That pattern doesn't match at all. I am sure it is the last bit that is wrong '{([^{}]*)}/' since that pattern works when there are no other braces within the body.
By replacing it with:
'{([^}]*)}/';
It matched till the closing } of the switch inside the if statement and stopped there (including } of the switch but excluding that of the if).
As well as this pattern, same result:
'{(\K[^}]*(?=)})/m';

Update #2
According to others comments
^\s*[\w\s]+\(.*\)\s*\K({((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/|#.*$|<<<\s*["']?(\w+)["']?[^;]+\3;$|[^{}<'"/#]++|[^{}]++|(?1))*)})
Note: A short RegEx i.e. {((?>[^{}]++|(?R))*)} is enough if you know your input does not contain { or } out of PHP syntax.
So a long RegEx, in what evil cases does it work?
You have [{}] in a string between quotation marks ["']
You have those quotation marks escaped inside one another
You have [{}] in a comment block. //... or /*...*/ or #...
You have [{}] in a heredoc or nowdoc <<<STR or <<<['"]STR['"]
Otherwise it is meant to have a pair of opening/closing braces and depth of nested braces is not important.
Do we have a case that it fails?
No unless you have a martian that lives inside your codes.
^ \s* [\w\s]+ \( .* \) \s* \K # how it matches a function definition
( # (1 start)
{ # opening brace
( # (2 start)
(?> # atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # single quoted strings
| // .* $ # a comment block starting with //
| /\* [\s\S]*? \*/ # a multi line comment block /*...*/
| \# .* $ # a single line comment block starting with #...
| <<< \s* ["']? # heredocs and nowdocs
( \w+ ) # (3) ^
["']? [^;]+ \3 ; $ # ^
| [^{}<'"/#]++ # force engine to backtack if it encounters special characters [<'"/#] (possessive)
| [^{}]++ # default matching bahaviour (possessive)
| (?1) # recurse 1st capturing group
)* # zero to many times of atomic group
) # (2 end)
} # closing brace
) # (1 end)
Formatting is done by #sln's RegexFormatter software.
What I provided in live demo?
Laravel's Eloquent Model.php file (~3500 lines) randomly is given as input. Check it out:
Live demo

This works to output header file (.h) out of inline function blocks (.c)
Find Regular expression:
(void\s[^{};]*)\n^\{($[^}$]*)\}$
Replace with:
$1;
For input:
void bar(int var)
{
foo(var);
foo2();
}
will output:
void bar(int var);
Get the body of the function block with second matched pattern :
$2
will output:
foo(var);
foo2();

Related

Return code blocks between curly braces in separate lines using regex and in separate regex groups [duplicate]

Given a dummy function as such:
public function handle()
{
if (isset($input['data']) {
switch($data) {
...
}
} else {
switch($data) {
...
}
}
}
My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...}.
I've come across recursive patterns but couldn't get my head around a regex that would match the function's body.
I've tried the following (no recursion):
$pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)"
'[\n\s]?[\t\s]*'. // regardless of the indentation preceding the {
'{([^{}]*)}/'; // find everything within braces.
preg_match($pattern, $contents, $match);
That pattern doesn't match at all. I am sure it is the last bit that is wrong '{([^{}]*)}/' since that pattern works when there are no other braces within the body.
By replacing it with:
'{([^}]*)}/';
It matched till the closing } of the switch inside the if statement and stopped there (including } of the switch but excluding that of the if).
As well as this pattern, same result:
'{(\K[^}]*(?=)})/m';
Update #2
According to others comments
^\s*[\w\s]+\(.*\)\s*\K({((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/|#.*$|<<<\s*["']?(\w+)["']?[^;]+\3;$|[^{}<'"/#]++|[^{}]++|(?1))*)})
Note: A short RegEx i.e. {((?>[^{}]++|(?R))*)} is enough if you know your input does not contain { or } out of PHP syntax.
So a long RegEx, in what evil cases does it work?
You have [{}] in a string between quotation marks ["']
You have those quotation marks escaped inside one another
You have [{}] in a comment block. //... or /*...*/ or #...
You have [{}] in a heredoc or nowdoc <<<STR or <<<['"]STR['"]
Otherwise it is meant to have a pair of opening/closing braces and depth of nested braces is not important.
Do we have a case that it fails?
No unless you have a martian that lives inside your codes.
^ \s* [\w\s]+ \( .* \) \s* \K # how it matches a function definition
( # (1 start)
{ # opening brace
( # (2 start)
(?> # atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # single quoted strings
| // .* $ # a comment block starting with //
| /\* [\s\S]*? \*/ # a multi line comment block /*...*/
| \# .* $ # a single line comment block starting with #...
| <<< \s* ["']? # heredocs and nowdocs
( \w+ ) # (3) ^
["']? [^;]+ \3 ; $ # ^
| [^{}<'"/#]++ # force engine to backtack if it encounters special characters [<'"/#] (possessive)
| [^{}]++ # default matching bahaviour (possessive)
| (?1) # recurse 1st capturing group
)* # zero to many times of atomic group
) # (2 end)
} # closing brace
) # (1 end)
Formatting is done by #sln's RegexFormatter software.
What I provided in live demo?
Laravel's Eloquent Model.php file (~3500 lines) randomly is given as input. Check it out:
Live demo
This works to output header file (.h) out of inline function blocks (.c)
Find Regular expression:
(void\s[^{};]*)\n^\{($[^}$]*)\}$
Replace with:
$1;
For input:
void bar(int var)
{
foo(var);
foo2();
}
will output:
void bar(int var);
Get the body of the function block with second matched pattern :
$2
will output:
foo(var);
foo2();

Extracting urls from #font-face by searching within #font-face for replacement

I have a web service that rewrites urls in css files so that they can be served via a CDN.
The css files can contain urls to images or fonts.
I currently have the following regex to match ALL urls within the css file:
(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))
However, I now want to introduce support for custom fonts and need to target the urls within #font-fontface:
#font-face {
font-family: 'FontAwesome';
src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
src: url("fonts/fontawesome-webfont.eot?#iefix&v=4.0.3") format("embedded-opentype"), url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"), url("fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"), url("fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular") format("svg");
font-weight: normal;
font-style: normal;
}
I then came up with the following:
#font-face\s*\{.*(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))\s*\}
The problem is that this matches everything and not just the urls inside. I thought I can use lookbehind like so:
(?<=#font-face\s*\{.*)(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))(?<=-\s*\})
Unfortunately, PCRE (which PHP uses) does not support variable repetition within a lookbehind, so I am stuck.
I do not wish to check for fonts by their extension as some fonts have the .svg extension which can conflict with images with the .svg extension.
In addition, I would also like to modify my original regex to match all other urls that are NOT within an #font-face:
.someclass {
background: url('images/someimage.png') no-repeat;
}
Since I am unable to use lookbehinds, how can I extract the urls from those within a #font-face and those that are not within a #font-face?
Disclaimer : You're maybe off using a library, because it's tougher than you think. I also want to start this answer on how to match URL's that are not within #font-face {}. I also suppose/define that the brackets {} are balanced within #font-face {}.
Note : I'm going to use "~" as delimiters instead of "/", this will releave me from escaping later on in my expressions. Also note that I will be posting online demos from regex101.com, on that site I'll be using the g modifier. You should remove the g modifier and just use preg_match_all().
Let's use some regex Fu !!!
Part 1 : matching url's that are not within #font-face {}
1.1 Matching #font-face {}
Oh yes, this might sound "weird" but you will notice later on why :)
We'll need some recursive regex here:
#font-face\s* # Match #font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
demo
1.2 Escaping #font-face {}
We'll use (*SKIP)(*FAIL) just after the previous regex, it will skip it. See this answer to get an idea how it works.
demo
1.3 Matching url()
We'll use something like this:
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
Note that I'm using \2 because I've appended this to the previous regex which has group 1.
Here's another use of ("|')(?:[^\\]|\\.)*?\1.
demo
1.4 Matching the value inside url()
You might have guessed we need to use some lookaround-fu, the problem is with a lookbehind since it needs to be fixed length. I've got a workaround for that, I'll introduce you to the \K escape sequence. It will reset the beginning of the match to the current position in the token list. more-info
Well let's drop \K somewhere in our expression and use a lookahead, our final regex will be :
#font-face\s* # Match #font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
(*SKIP)(*FAIL) # Skip it
| # Or
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
demo
1.5 Using the pattern in PHP
We'll need to escape some things like quotes, backslashes \\\\ = \, use the right function and the right modifiers:
$regex = '~
#font-face\s* # Match #font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
(*SKIP)(*FAIL) # Skip it
| # Or
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|\'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["\']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\\\]|\\\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
~xs';
$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';
demo
Part 2 : matching url's that are within #font-face {}
2.1 Different approach
I want to do this part in 2 regexes because it will be a pain to match URL's that are within #font-face {} while taking care of the state of braces {} in a recursive regex.
And since we already have the pieces we need, we'll only need to apply them in some code:
Match all #font-face {} instances
Loop through these and match all url()'s
2.2 Putting it into code
$results = array(); // Just an empty array;
$fontface_regex = '~
#font-face\s* # Match #font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
~xs';
$url_regex = '~
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|\'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["\']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url\'s with http, https or ftp)
(?:[^\\\\]|\\\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\1 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
~xs';
$input = file_get_contents($css_file);
preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
foreach($fontfaces[0] as $fontface){ // Foreach instance
preg_match_all($url_regex, $fontface, $r); // Let's match the url's
if(isset($r[0])){ // If there is a hit
$results[] = $r[0]; // Then add it to the results array
}
}
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results
demo
                                                                    Join the regex chatroom !
You can use this:
$pattern = <<<'LOD'
~
(?(DEFINE)
(?<quoted_content>
(["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
)
(?<comment> /\* .*? \*/ )
(?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
(?<other_content>
(?> [^u}/"']++ | \g<quoted_content> | \g<comment>
| \Bu | u(?!rl\s*+\() | /(?!\*)
| \g<url_start> \g<url_skip> ["']?+
)++
)
(?<anchor> \G(?<!^) ["']?+ | #font-face \s*+ { )
(?<url_start> url\( \s*+ ["']?+ )
)
\g<comment> (*SKIP)(*FAIL) |
\g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+
( [^"'\s)}]*+ ) # url
~xs
LOD;
$result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
print_r($result);
test string
$data = <<<'LOD'
#font-face {
font-family: 'FontAwesome';
src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
/*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
font-weight: normal;
font-style: normal;
}
/*
#font-face {
font-family: 'Font1';
src: url("fonts/font1.eot");
} */
#font-face {
font-family: 'Fon\'t2';
src: url("fonts/font2.eot");
}
#font-face {
font-family: 'Font3';
src: url("../fonts/font3.eot");
}
LOD;
Main idea:
For more readability the pattern is divided into named subpatterns. The (?(DEFINE)...) doesn't match anything, it is only a definition section.
The main trick of this pattern is the use of the \G anchor that means: start of the string or contiguous to a precedent match. I added a negative lookbehind (?<!^) to avoid the first part of this definition.
The <anchor> named subpattern is the most important because it allows a match only if #font-face { is found or immediately after the end of an url (this is the reason why you can see a ["']?+).
<other_content> represents all that is not an url section but matches url sections that must be skipped too(urls that begin with "http:", "data:"). The important detail of this subpattern is that it can't match the closing curly bracket of #font-face.
The mission of <url_start> is only to match url(".
\K resets all the substring that has been matched before from the match result.
([^"'\s)}]*+) matches the url (the only thing that stay in the match result with the leading ./../ )
Since <other_content> and the url subpattern can't match a } (that is outside quoted or comment parts), you are sure to never match something outside of the #font-face definition, the second consequence is that the pattern always fails after the last url. Thus, at the next attempt the "contiguous branch" will fail until the next #font-face.
another trick:
The main pattern begins with \g<comment> (*SKIP)(*FAIL) | to skip all content inside comments /*....*/. \g<comment> refers to the basic subpattern that describes how a comment look like. (*SKIP) forbids to retry the substring that has been matched before (on his left, by g<comment>), if the pattern fails on his right. (*FAIL) forces the pattern to fail.
With this trick, comments are skipped and are not a match result (since the pattern fails).
subpatterns details:
quoted_content:
It's used in <other_content> to avoid to match url( or /* that are inside quotes.
(["']) # capture group: the opening quote
(?> # atomic group: all possible content between quotes
[^"'\\]++ # all that is not a quote or a backslash
| # OR
\\{2} # two backslashes: (two \ doesn't escape anything)
| # OR
\\. # any escaped character
| # OR
(?!\g{-1})["'] # the other quote (this one that is not in the capture group)
)*+ # repeat zero or more time the atomic group
\g{-1} # backreference to the last capturing group
other_content: all that is not the closing curly bracket, or an url without http: or data:
(?> # open an atomic group
[^u}/"']++ # all character that are not problematic!
|
\g<quoted_content> # string inside quotes
|
\g<comment> # string inside comments
|
\Bu # "u" not preceded by a word boundary
|
u(?!rl\s*+\() # "u" not followed by "rl(" (not the start of an url definition)
|
/(?!\*) # "/" not followed by "*" (not the start of a comment)
|
\g<url_start> # match the url that begins with "http:"
\g<url_skip> ["']?+ # until the possible quote
)++ # repeat the atomic group one or more times
anchor
\G(?<!^) ["']?+ # contiguous to a precedent match with a possible closing quote
| # OR
#font-face \s*+ { # start of the #font-face definition
Notice:
You can improve the main pattern:
After the last url of #font-face, the regex engine attempts to match with the "contiguous branch" of <anchor> and match all characters until the } that makes the pattern fail. Then, on each same characters, the regex engine must try the two branches or <anchor> (that will always fail until the }.
To avoid these useless tries, you can change the main pattern to:
\g<comment> (*SKIP)(*FAIL) |
\g<anchor> \g<other_content>?+
(?>
\g<url_start> \K [./]*+ ([^"'\s)}]*+)
|
} (*SKIP)(*FAIL)
)
With this new scenario, the first character after the last url is matched by the "contiguous branch", \g<other_content> matches all until the }, \g<url_start> fails immediatly, the } is matched and (*SKIP)(*FAIL) make the pattern fail and forbids to retry these characters.

Remove comments from JSON data

I need to remove all /*...*/ style comments from JSON data. How do I do it with regular expressions so that string values like this
{
"propName": "Hello \" /* hi */ there."
}
remain unchanged?
You must first avoid all the content that is inside double quotes using the backtrack control verbs SKIP and FAIL (or a capture)
$string = <<<'LOD'
{
"propName": "Hello \" /* don't remove **/ there." /*this must be removed*/
}
LOD;
$result = preg_replace('~"(?:[^\\\"]+|\\\.)*+"(*SKIP)(*FAIL)|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '',$string);
// The same with a capture:
$result = preg_replace('~("(?:[^\\\"]+|\\\.)*+")|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '$1',$string);
Pattern details:
"(?:[^\\\"]+|\\\.)*+"
This part describe the possible content inside quotes:
" # literal quote
(?: # open a non-capturing group
[^\\\"]+ # all characters that are not \ or "
| # OR
\\\.)*+ # escaped char (that can be a quote)
"
Then You can make this subpattern fails with (*SKIP)(*FAIL) or (*SKIP)(?!). The SKIP forbid the backtracking before this point if the pattern fails after. FAIL forces the pattern to fail. Thus, quoted part are skipped (and can't be in the result since you make the subpattern fail after).
Or you use a capturing group and you add the reference in the replacement pattern.
/\*(?:[^*]+|\*+(?!/))*+\*/
This part describe content inside comments.
/\* # open the comment
(?:
[^*]+ # all characters except *
| # OR
\*+(?!/) # * not followed by / (note that you can't use
# a possessive quantifier here)
)*+ # repeat the group zero or more times
\*/ # close the comment
The s modifier is used here only when a backslash is before a newline inside quotes.

Regular expression for template engine?

I'm learning about regular expressions and want to write a templating engine in PHP.
Consider the following "template":
<!DOCTYPE html>
<html lang="{{print("{hey}")}}" dir="{{$dir}}">
<head>
<meta charset="{{$charset}}">
</head>
<body>
{{$body}}
{{}}
</body>
</html>
I managed to create a regex that will find anything except for {{}}.
Here's my regex:
{{[^}]+([^{])*}}
There's just one problem. How do I allow the literal { and } to be used within {{}} tags?
It will not find {{print("{hey}")}}.
Thanks in advance.
This is a pattern to match the content inside double curly brackets:
$pattern = <<<'LOD'
~
(?(DEFINE)
(?<quoted>
' (?: [^'\\]+ | (?:\\.)+ )++ ' |
" (?: [^"\\]+ | (?:\\.)+ )++ "
)
(?<nested>
{ (?: [^"'{}]+ | \g<quoted> | \g<nested> )*+ }
)
)
{{
(?<content>
(?:
[^"'{}]+
| \g<quoted>
| \g<nested>
)*+
)
}}
~xs
LOD;
Compact version:
$pattern = '~{{((?>[^"\'{}]+|((["\'])(?:[^"\'\\\]+|(?:\\.)+|(?:(?!\3)["\'])+)++\3)|({(?:[^"\'{}]+|\g<2>|(?4))*+}))*+)}}~s';
The content is in the first capturing group, but you can use the named capture 'content' with the detailed version.
If this pattern is longer, it allows all that you want inside quoted parts including escaped quotes, and is faster than a simple lazy quantifier in much cases.
Nested curly brackets are allowed too, you can write {{ doThat(){ doThis(){ }}}} without problems.
The subpattern for quotes can be written like this too, avoiding to repeat the same thing for single and double quotes (I use it in compact version)
(["']) # the quote type is captured (single or double)
(?: # open a group (for the various alternatives)
[^"'\\]+ # all characters that are not a quote or a backslash
| # OR
(?:\\.)+ # escaped characters (with the \s modifier)
| #
(?!\g{-1})["'] # a quote that is not the captured quote
)++ # repeat one or more times
\g{-1} # the captured quote (-1 refers to the last capturing group)
Notice: a backslash must be written \\ in nowdoc syntax but \\\ or \\\\ inside single quotes.
Explanations for the detailed pattern:
The pattern is divided in two parts:
the definitions where i define named subpatterns
the whole pattern itself
The definition section is useful to avoid to repeat always the same subpattern several times in the main pattern or to make it more clear. You can define subpatterns that you will use later in this space: (?(DEFINE)....)
This section contains 2 named subpatterns:
quoted : that contains the description of quoted parts
nested : that describes nested curly brackets parts
detail of nested
(?<nested> # open the named group "nested"
{ # literal {
## what can contain curly brackets? ##
(?> # open an atomic* group
[^"'{}]+ # all characters one or more times, except "'{}
| # OR
\g<quoted> # quoted content, to avoid curly brackets inside quoted parts
# (I call the subpattern I have defined before, instead of rewrite all)
| \g<nested> # OR curly parts. This is a recursion
)*+ # repeat the atomic group zero or more times (possessive *)
} # literal }
) # close the named group
(* more informations about atomic groups and possessive quantifiers)
But all of this are only definitions, the pattern begins really with: {{
Then I open a named capture group (content) and I describe what can be found inside, (nothing new here).
I use to modifiers, x and s. x activates the verbose mode that allows to put freely spaces in the pattern (useful to indent). s is the singleline mode. In this mode, the dot can match newlines (it can't by default). I use this mode because there is a dot in the subpattern quoted.
You can just use "." instead of the character classes. But you then have to make use of non-greedy quantifiers:
\{\{(.+?)\}\}
The quantifier "+?" means it will consume the least necessary number of characters.
Consider this example:
<table>
<tr>
<td>{{print("{first name}")}}</td><td>{{print("{last name}")}}</td>
</tr>
</table>
With a greedy quantifier (+ or *), you'd only get one result, because it sees the first {{ and then the .+ consumes as many characters as it can as long as the pattern is matched:
{{print("{first name}")}}</td><td>{{print("{last name}")}}
With a non-greedy one (+? or *?) you'll get the two as separate results:
{{print("{first name}")}}
{{print("{last name}")}}
Make you regex less greedy using {{(.*?)}}.
I figured it out. Don't ask me how.
{{[^{}]*("[^"]*"\))?(}})
This will match pretty much anything.. like for example:
{{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}

Regex match semicolon but not in comments or quotes

I want to use a regex test to return all the matched semicolons, but only if they're outside of quotes (nested quotes), and not commented code.
testfunc();
testfunc2("test;test");
testfunc3("test';test");
testfunc4('test";test');
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test\"test");
Only the semicolons on the end of each example should be returned by the regex string.
I've been playing around with the below, but it fails on example testfunc3 and testfun9. It also doesn't ignore comments...
/;(?=(?:(?:[^"']*+["']){2})*+[^"']*+\z)/g
Any help would be appreciated!
Don't have time to convert this into JS. Here is the regex in a Perl sample, the regex will work with JS though.
C comments, double/single string quotes - taken from "strip C comments" by Jeffrey Friedl and later modified by Fred Curtis, adapted to include C++ comments and the target semi-colon (by me).
Capture group 1 (optional), includes all up to semi-colon, group 2 is semi-colon (but can be anything).
Modifiers are //xsg.
The regex below is used in the substitution operator s/pattern/replace/xsg (ie: replace with $1[$2] ).
I think your post is just to find out if this can be done. I can include a commented regex if you really need it.
$str = <<EOS;
testfunc();
testfunc2("test;test");
testfunc3("test';test");
testfunc4('test";test');
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test\"test");
EOS
$str =~ s{
((?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|.[^/"'\\;]*))*?)(;)
}
{$1\[$2\]}xsg;
print $str;
Output
testfunc()[;]
testfunc2("test;test")[;]
testfunc3("test';test")[;]
testfunc4('test";test')[;]
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test"test")[;]
Expanded with comments
( ## Optional non-greedy, Capture group 1
(?:
## Comments
(?:
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
(?:
[^/*][^*]*\*+
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
|
// ## Start of // ... comment
(?:
[^\\] ## Any Non-Continuation character ^\
| ## OR
\\\n? ## Any Continuation character followed by 0-1 newline \n
)*? ## To be done 0-many times, stopping at the first end of comment
\n ## End of // comment
)
| ## OR, various things which aren't comments, group 2:
(?:
" (?: \\. | [^"\\] )* " ## Double quoted text
|
' (?: \\. | [^'\\] )* ' ## Single quoted text
|
. ## Any other char
[^/"'\\;]* ## Chars which doesn't start a comment, string, escape
) ## or continuation (escape + newline) AND are NOT semi-colon ;
)*?
)
## Capture grou 2, the semi-colon
(;)
This would work for all your examples, but it depends how close the code you want to apply it to is to the example:
;(?!\S|(?:[^;]*\*/))
; - match the semicolon
(?! - negative lookahead - ensure that ->
\S - there is no non-whitespace character after the semicolon
|(?:[^;]*\*/)) - and if there is a whitespace char, ensure that up to the next ; there is no */ sign
Let me know if you get any problems with that.
If that's something that you want to use once there is no harm in using regex, but if it's something that you might want to reuse later regex might prove not the most reliable of tools.
EDIT:
fix for No. 5 - now the semicolon will be in the first matching group:
^(?:[^/]*)(;)(?!\S|(?:[^;]*\*/))

Categories