I want to use a regex test to return all the matched semicolons, but only if they're outside of quotes (nested quotes), and not commented code.
testfunc();
testfunc2("test;test");
testfunc3("test';test");
testfunc4('test";test');
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test\"test");
Only the semicolons on the end of each example should be returned by the regex string.
I've been playing around with the below, but it fails on example testfunc3 and testfun9. It also doesn't ignore comments...
/;(?=(?:(?:[^"']*+["']){2})*+[^"']*+\z)/g
Any help would be appreciated!
Don't have time to convert this into JS. Here is the regex in a Perl sample, the regex will work with JS though.
C comments, double/single string quotes - taken from "strip C comments" by Jeffrey Friedl and later modified by Fred Curtis, adapted to include C++ comments and the target semi-colon (by me).
Capture group 1 (optional), includes all up to semi-colon, group 2 is semi-colon (but can be anything).
Modifiers are //xsg.
The regex below is used in the substitution operator s/pattern/replace/xsg (ie: replace with $1[$2] ).
I think your post is just to find out if this can be done. I can include a commented regex if you really need it.
$str = <<EOS;
testfunc();
testfunc2("test;test");
testfunc3("test';test");
testfunc4('test";test');
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test\"test");
EOS
$str =~ s{
((?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|.[^/"'\\;]*))*?)(;)
}
{$1\[$2\]}xsg;
print $str;
Output
testfunc()[;]
testfunc2("test;test")[;]
testfunc3("test';test")[;]
testfunc4('test";test')[;]
//testfunc5();
/* testfunc6(); */
/*
testfunc7();
*/
/*
//testfunc8();
*/
testfunc9("test"test")[;]
Expanded with comments
( ## Optional non-greedy, Capture group 1
(?:
## Comments
(?:
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
(?:
[^/*][^*]*\*+
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
|
// ## Start of // ... comment
(?:
[^\\] ## Any Non-Continuation character ^\
| ## OR
\\\n? ## Any Continuation character followed by 0-1 newline \n
)*? ## To be done 0-many times, stopping at the first end of comment
\n ## End of // comment
)
| ## OR, various things which aren't comments, group 2:
(?:
" (?: \\. | [^"\\] )* " ## Double quoted text
|
' (?: \\. | [^'\\] )* ' ## Single quoted text
|
. ## Any other char
[^/"'\\;]* ## Chars which doesn't start a comment, string, escape
) ## or continuation (escape + newline) AND are NOT semi-colon ;
)*?
)
## Capture grou 2, the semi-colon
(;)
This would work for all your examples, but it depends how close the code you want to apply it to is to the example:
;(?!\S|(?:[^;]*\*/))
; - match the semicolon
(?! - negative lookahead - ensure that ->
\S - there is no non-whitespace character after the semicolon
|(?:[^;]*\*/)) - and if there is a whitespace char, ensure that up to the next ; there is no */ sign
Let me know if you get any problems with that.
If that's something that you want to use once there is no harm in using regex, but if it's something that you might want to reuse later regex might prove not the most reliable of tools.
EDIT:
fix for No. 5 - now the semicolon will be in the first matching group:
^(?:[^/]*)(;)(?!\S|(?:[^;]*\*/))
Related
Given a dummy function as such:
public function handle()
{
if (isset($input['data']) {
switch($data) {
...
}
} else {
switch($data) {
...
}
}
}
My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...}.
I've come across recursive patterns but couldn't get my head around a regex that would match the function's body.
I've tried the following (no recursion):
$pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)"
'[\n\s]?[\t\s]*'. // regardless of the indentation preceding the {
'{([^{}]*)}/'; // find everything within braces.
preg_match($pattern, $contents, $match);
That pattern doesn't match at all. I am sure it is the last bit that is wrong '{([^{}]*)}/' since that pattern works when there are no other braces within the body.
By replacing it with:
'{([^}]*)}/';
It matched till the closing } of the switch inside the if statement and stopped there (including } of the switch but excluding that of the if).
As well as this pattern, same result:
'{(\K[^}]*(?=)})/m';
Update #2
According to others comments
^\s*[\w\s]+\(.*\)\s*\K({((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/|#.*$|<<<\s*["']?(\w+)["']?[^;]+\3;$|[^{}<'"/#]++|[^{}]++|(?1))*)})
Note: A short RegEx i.e. {((?>[^{}]++|(?R))*)} is enough if you know your input does not contain { or } out of PHP syntax.
So a long RegEx, in what evil cases does it work?
You have [{}] in a string between quotation marks ["']
You have those quotation marks escaped inside one another
You have [{}] in a comment block. //... or /*...*/ or #...
You have [{}] in a heredoc or nowdoc <<<STR or <<<['"]STR['"]
Otherwise it is meant to have a pair of opening/closing braces and depth of nested braces is not important.
Do we have a case that it fails?
No unless you have a martian that lives inside your codes.
^ \s* [\w\s]+ \( .* \) \s* \K # how it matches a function definition
( # (1 start)
{ # opening brace
( # (2 start)
(?> # atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # single quoted strings
| // .* $ # a comment block starting with //
| /\* [\s\S]*? \*/ # a multi line comment block /*...*/
| \# .* $ # a single line comment block starting with #...
| <<< \s* ["']? # heredocs and nowdocs
( \w+ ) # (3) ^
["']? [^;]+ \3 ; $ # ^
| [^{}<'"/#]++ # force engine to backtack if it encounters special characters [<'"/#] (possessive)
| [^{}]++ # default matching bahaviour (possessive)
| (?1) # recurse 1st capturing group
)* # zero to many times of atomic group
) # (2 end)
} # closing brace
) # (1 end)
Formatting is done by #sln's RegexFormatter software.
What I provided in live demo?
Laravel's Eloquent Model.php file (~3500 lines) randomly is given as input. Check it out:
Live demo
This works to output header file (.h) out of inline function blocks (.c)
Find Regular expression:
(void\s[^{};]*)\n^\{($[^}$]*)\}$
Replace with:
$1;
For input:
void bar(int var)
{
foo(var);
foo2();
}
will output:
void bar(int var);
Get the body of the function block with second matched pattern :
$2
will output:
foo(var);
foo2();
I'm not really good with regex (i'm on this one for hours) and I struggle to replace all empty lines between 2 identifier ("{|" and "|}")
My regex look like that (sorry for your eyes) : (\{\|)((?:(?!\|\}).)+)(?:\n\n)((?:(?!\|\}).)+)(\|\})
(\{\|) : the character "{|"
((?:(?!\|\}).)+) : Everything if not after "|}" (negative lookahead)
(?:\n\n) : The empty line I want to delete
((?:(?!\|\}).)+) : Everything if not after "|}" (negative lookahead)
(\|\}) : the character "|}"
Demo
It works, but it delete only the last empty line, can you help me to make it work with all the empty lines ?
I tryed to add a negative lookahead on \n\n with a repeating group on everything but it did not work.
Several ways:
The \G based pattern: (only one pattern is needed)
$txt = preg_replace('~ (?: \G (?!\A) | \Q{|\E ) [^|\n]*+ (?s: (?! \Q|}\E | \n\n) . [^|\n]*)*+ \n \K \n+ ~x', '', $txt);
The \G matches the start of the string or the position in the string after the last successful match. This ensures that several matches are contigous.
What I call a \G based pattern can be schematized like that:
(?: \G position after a successful match | first match beginning ) reach the target \K target
The "reach the target" part is designed to never match the closing sequence |}. So once the last target is found, the \G part will fail until the first match part succeeds again.
~
### The beginning
(?:
\G (?!\A) # contigous to a successful match
|
\Q{|\E # opening sequence
#; note that you can add `[^{]* (*SKIP)` before to quickly avoid
#; all failing positions
#; note that if you want to check that the opening sequence is followed by
#; a closing sequence (without an other opening sequence), you can do it
#; here using a lookahead
)
### lets reach the target
#; note that all this part can also be written like that `(?s:(?!\|}|\n\n).)*`
#; or `(?s:[^|\n]|(?!\|}|\n\n).)*`, but I choosed the unrolled pattern that is
#; more efficient.
[^|\n]*+ # all that isn't a pipe or a newline
# eventually a character that isn't the start of |} or \n\n
(?s:
(?! \Q|}\E | \n\n ) # negative lookahead
. # the character
[^|\n]*
)*+
#; adding a `(*SKIP)` here can also be usefull if there's no more empty lines
#; until the closing sequence
### The target
\n \K \n+ # the \K is a conveniant way to define the start of the returned match
# result, this way, only \n+ is replaced (with nothing)
~x
or preg_replace_callback: (more simple)
$txt = preg_replace_callback('~\Q{|\E .*? \Q|}\E~sx', function ($m) {
return preg_replace('~\n+~', "\n", $m[0]);
}, $txt);
demos
You can use a positive lookahead pattern to ensure that a matching blank line is followed by |}, but also use a negative lookahead pattern to ensure that none of the characters between the blank line and the |} is the starting position of a {|:
\n{2,}(?=(?:(?!\{\|).)*?\|\})
Demo: https://regex101.com/r/oWfkg1/8
If you use:
(?<={\|)(\n{2,}|(\r\n){2,}|\s+)(?=\|})
Then it will match new lines and empty space found between {| and |}
Given a dummy function as such:
public function handle()
{
if (isset($input['data']) {
switch($data) {
...
}
} else {
switch($data) {
...
}
}
}
My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...}.
I've come across recursive patterns but couldn't get my head around a regex that would match the function's body.
I've tried the following (no recursion):
$pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)"
'[\n\s]?[\t\s]*'. // regardless of the indentation preceding the {
'{([^{}]*)}/'; // find everything within braces.
preg_match($pattern, $contents, $match);
That pattern doesn't match at all. I am sure it is the last bit that is wrong '{([^{}]*)}/' since that pattern works when there are no other braces within the body.
By replacing it with:
'{([^}]*)}/';
It matched till the closing } of the switch inside the if statement and stopped there (including } of the switch but excluding that of the if).
As well as this pattern, same result:
'{(\K[^}]*(?=)})/m';
Update #2
According to others comments
^\s*[\w\s]+\(.*\)\s*\K({((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/|#.*$|<<<\s*["']?(\w+)["']?[^;]+\3;$|[^{}<'"/#]++|[^{}]++|(?1))*)})
Note: A short RegEx i.e. {((?>[^{}]++|(?R))*)} is enough if you know your input does not contain { or } out of PHP syntax.
So a long RegEx, in what evil cases does it work?
You have [{}] in a string between quotation marks ["']
You have those quotation marks escaped inside one another
You have [{}] in a comment block. //... or /*...*/ or #...
You have [{}] in a heredoc or nowdoc <<<STR or <<<['"]STR['"]
Otherwise it is meant to have a pair of opening/closing braces and depth of nested braces is not important.
Do we have a case that it fails?
No unless you have a martian that lives inside your codes.
^ \s* [\w\s]+ \( .* \) \s* \K # how it matches a function definition
( # (1 start)
{ # opening brace
( # (2 start)
(?> # atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # single quoted strings
| // .* $ # a comment block starting with //
| /\* [\s\S]*? \*/ # a multi line comment block /*...*/
| \# .* $ # a single line comment block starting with #...
| <<< \s* ["']? # heredocs and nowdocs
( \w+ ) # (3) ^
["']? [^;]+ \3 ; $ # ^
| [^{}<'"/#]++ # force engine to backtack if it encounters special characters [<'"/#] (possessive)
| [^{}]++ # default matching bahaviour (possessive)
| (?1) # recurse 1st capturing group
)* # zero to many times of atomic group
) # (2 end)
} # closing brace
) # (1 end)
Formatting is done by #sln's RegexFormatter software.
What I provided in live demo?
Laravel's Eloquent Model.php file (~3500 lines) randomly is given as input. Check it out:
Live demo
This works to output header file (.h) out of inline function blocks (.c)
Find Regular expression:
(void\s[^{};]*)\n^\{($[^}$]*)\}$
Replace with:
$1;
For input:
void bar(int var)
{
foo(var);
foo2();
}
will output:
void bar(int var);
Get the body of the function block with second matched pattern :
$2
will output:
foo(var);
foo2();
I am masking all characters between single quotes (inclusively) within a string using preg_replace_callback(). But I would like to only use preg_replace() if possible, but haven't been able to figure it out. Any help would be appreciated.
This is what I have using preg_replace_callback() which produces the correct output:
function maskCallback( $matches ) {
return str_repeat( '-', strlen( $matches[0] ) );
}
function maskString( $str ) {
return preg_replace_callback( "('.*?')", 'maskCallback', $str );
}
$str = "TEST 'replace''me' ok 'me too'";
echo $str,"\n";
echo $maskString( $str ),"\n";
Output is:
TEST 'replace''me' ok 'me too'
TEST ------------- ok --------
I have tried using:
preg_replace( "/('.*?')/", '-', $str );
but the dashes get consumed, e.g.:
TEST -- ok -
Everything else I have tried doesn't work either. (I'm obviously not a regex expert.) Is this possible to do? If so, how?
Yes you can do it, (assuming that quotes are balanced) example:
$str = "TEST 'replace''me' ok 'me too'";
$pattern = "~[^'](?=[^']*(?:'[^']*'[^']*)*+'[^']*\z)|'~";
$result = preg_replace($pattern, '-', $str);
The idea is: you can replace a character if it is a quote or if it is followed by an odd number of quotes.
Without quotes:
$pattern = "~(?:(?!\A)\G|(?:(?!\G)|\A)'\K)[^']~";
$result = preg_replace($pattern, '-', $str);
The pattern will match a character only when it is contiguous to a precedent match (In other words, when it is immediately after the last match) or when it is preceded by a quote that is not contiguous to the precedent match.
\G is the position after the last match (at the beginning it is the start of the string)
pattern details:
~ # pattern delimiter
(?: # non capturing group: describe the two possibilities
# before the target character
(?!\A)\G # at the position in the string after the last match
# the negative lookbehind ensure that this is not the start
# of the string
| # OR
(?: # (to ensure that the quote is a not a closing quote)
(?!\G) # not contiguous to a precedent match
| # OR
\A # at the start of the string
)
' # the opening quote
\K # remove all precedent characters from the match result
# (only one quote here)
)
[^'] # a character that is not a quote
~
Note that since the closing quote is not matched by the pattern, the following characters that are not quotes can't be matched because there is no precedent match.
EDIT:
The (*SKIP)(*FAIL) way:
Instead of testing if a single quote is not a closing quote with (?:(?!\G)|\A)' like in the precedent pattern, you can break the match contiguity on closing quotes using the backtracking control verbs (*SKIP) and (*FAIL) (That can be shorten to (*F)).
$pattern = "~(?:(?!\A)\G|')(?:'(*SKIP)(*F)|\K[^'])~";
$result = preg_replace($pattern, '-', $str);
Since the pattern fails on each closing quotes, the following characters will not be matched until the next opening quote.
The pattern may be more efficient written like this:
$pattern = "~(?:\G(?!\A)(?:'(*SKIP)(*F))?|'\K)[^']~";
(You can also use (*PRUNE) in place of (*SKIP).)
Short answer : It's possible !!!
Use the following pattern
' # Match a single quote
(?= # Positive lookahead, this basically makes sure there is an odd number of single quotes ahead in this line
(?:(?:[^'\r\n]*'){2})* # Match anything except single quote or newlines zero or more times followed by a single quote, repeat this twice and repeat this whole process zero or more times (basically a pair of single quotes)
(?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # You guessed, this is to match a single quote until the end of line
)
| # or
\G(?<!^) # Preceding contiguous match (not beginning of line)
[^'] # Match anything that's not a single quote
(?= # Same as above
(?:(?:[^'\r\n]*'){2})* # Same as above
(?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # Same as above
)
|
\G(?<!^) # Preceding contiguous match (not beginning of line)
' # Match a single quote
Make sure to use the m modifier.
Online demo.
Long answer : It's a pain :)
Unless not only you but your whole team loves regex, you might think of using this regex but remember that this is insane and quite difficult to grasp for beginners. Also readability goes (almost) always first.
I'll break the idea of how I did write such a regex:
1) We first need to know what we actually want to replace, we want to replace every character (including the single quotes) that's between two single quotes with a hyphen.
2) If we're going to use preg_replace() that means our pattern needs to match one single character each time.
3) So the first step would be obvious : '.
4) We'll use \G which means match beginning of string or the contiguous character that we matched earlier. Take this simple example ~a|\Gb~. This will match a or b if it's at the beginning or b if the previous match was a. See this demo.
5) We don't want anything to do with beginning of string So we'll use \G(?<!^).
6) Now we need to match anything that's not a single quote ~'|\G(?<!^)[^']~.
7) Now begins the real pain, how do we know that the above pattern wouldn't go match c in 'ab'c ? Well it will, we need to count the single quotes...
Let's recap:
a 'bcd' efg 'hij'
^ It will match this first
^^^ Then it will match these individually with \G(?<!^)[^']
^ It will match since we're matching single quotes without checking anything
^^^^^ And it will continue to match ...
What we want could be done in those 3 rules:
a 'bcd' efg 'hij'
1 ^ Match a single quote only if there is an odd number of single quotes ahead
2 ^^^ Match individually those characters only if there is an odd number of single quotes ahead
3 ^ Match a single quote only if there was a match before this character
8) Checking if there is an odd number of single quotes could be done if we knew how to match an even number :
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
9) An odd number would be easy now, we just need to add :
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
10) Merging above in a single lookahead:
(?=
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
11) Now we need to merge all 3 rules we defined earlier:
~ # A modifier
#################################### Rule 1 ####################################
' # A single quote
(?= # Lookahead to make sure there is an odd number of single quotes ahead
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
| # Or
#################################### Rule 2 ####################################
\G(?<!^) # Preceding contiguous match (not beginning of line)
[^'] # Match anything that's not a single quote
(?= # Lookahead to make sure there is an odd number of single quotes ahead
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
| # Or
#################################### Rule 3 ####################################
\G(?<!^) # Preceding contiguous match (not beginning of line)
' # Match a single quote
~x
Online regex demo.
Online PHP demo
Well, just for the fun of it and I seriously wouldn't recommend something like that because I try to avoid lookarounds when they are not necessary, here's one regex that uses the concept of 'back to the future':
(?<=^|\s)'(?!\s)|(?!^)(?<!'(?=\s))\G.
regex101 demo
Okay, it's broken down into two parts:
1. Matching the beginning single quote
(?<=^|\s)'(?!\s)
The rules that I believe should be established here are:
There should be either ^ or \s before the beginning quote (hence (?<=^|\s)).
There is no \s after the beginning quote (hence (?!\s)).
2. Matching the things inside the quote, and the ending quote
(?!^)\G(?<!'(?=\s)).
The rules that I believe should be established here are:
The character can be any character (hence .)
The match is 1 character long and following the immediate previous match (hence (?!^)\G).
There should be no single quote, that is itself followed by a space, before it (hence (?<!'(?=\s)) and this is the 'back to the future' part). This effectively will not match a \s that is preceded by a ' and will mark the end of the characters wrapped between single quotes. In other words, the closing quote will be identified as a single quote followed by \s.
If you prefer pictures...
I need to remove all /*...*/ style comments from JSON data. How do I do it with regular expressions so that string values like this
{
"propName": "Hello \" /* hi */ there."
}
remain unchanged?
You must first avoid all the content that is inside double quotes using the backtrack control verbs SKIP and FAIL (or a capture)
$string = <<<'LOD'
{
"propName": "Hello \" /* don't remove **/ there." /*this must be removed*/
}
LOD;
$result = preg_replace('~"(?:[^\\\"]+|\\\.)*+"(*SKIP)(*FAIL)|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '',$string);
// The same with a capture:
$result = preg_replace('~("(?:[^\\\"]+|\\\.)*+")|/\*(?:[^*]+|\*+(?!/))*+\*/~s', '$1',$string);
Pattern details:
"(?:[^\\\"]+|\\\.)*+"
This part describe the possible content inside quotes:
" # literal quote
(?: # open a non-capturing group
[^\\\"]+ # all characters that are not \ or "
| # OR
\\\.)*+ # escaped char (that can be a quote)
"
Then You can make this subpattern fails with (*SKIP)(*FAIL) or (*SKIP)(?!). The SKIP forbid the backtracking before this point if the pattern fails after. FAIL forces the pattern to fail. Thus, quoted part are skipped (and can't be in the result since you make the subpattern fail after).
Or you use a capturing group and you add the reference in the replacement pattern.
/\*(?:[^*]+|\*+(?!/))*+\*/
This part describe content inside comments.
/\* # open the comment
(?:
[^*]+ # all characters except *
| # OR
\*+(?!/) # * not followed by / (note that you can't use
# a possessive quantifier here)
)*+ # repeat the group zero or more times
\*/ # close the comment
The s modifier is used here only when a backslash is before a newline inside quotes.