I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?
Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.
Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
Outside a character class: . ^ $ * + ? ( ) [ { } \ |
Inside a character class: ^ - [ ]
BRE/ed/grep/sed
Outside a character class: . ^ $ * [ \
Inside a character class: ^ - [ ]
For literals, don't escape: + ? ( ) { } |
For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.
POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.
Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.
Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.
Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:
https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/
For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()
To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).
For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.
to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator
Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/
Related
I have finally started to understand the context behind escaping hexadecimal characters such as \x80. The documentation talks about the escape sequences, but I can also see that some regular expression use double backslashes such as \\x80 - \\xFF.
What's the difference between \\x80 - \\xFF and \x80 - \xFF when using something like preg_replace ?
When using preg_ functions, your string is parsed twice - first, by php compiler, and then by the PCRE engine. So if you have, for example:
preg_match("/\x80/"....)
the compiler turns it into
preg_match("/�/"....) // let � be chr(80)
and passes this to PCRE. When you have two slashes:
preg_match("/\\x80/"....)
the compiler turns the string into
preg_match("/\x80/"....)
and then it's the PCRE engine that converts this to the literal character �.
It doesn't make a difference in this particular case, but consider:
preg_match("/\x5B/"....)
after compilation
preg_match("/[/"....)
and PCRE fails, because of the dangling metacharacter [. Now if you escape the slash
preg_match("/\\x5B/"....)
it's compiled to
preg_match("/\x5B/"....)
which makes PCRE happy, because it understands that [ should be taken literally.
How exactly php compiles your string depends on the quotes you use: double/single/heredocs/nowdocs. See docs for details. A simple rule of thumb is to use single quotes when possible, if you have to use doubles (for variable interpolation), escape everything twice, even if there's technically no need (e.g "\\b$word\\b").
To write hex x80, you use \ and that way you get \x80.
Now in PHP string \ escapes special characters. In string "$var" PHP will try to insert variable $var in that string (because string uses ". To escape $ you write "\$var" and output will be just simple string $var.
Now to write \ in string (no matter if it uses " or ') you use same escaping character \. So it becomes \\ to output \.
If you write "\x80" your output will be "x80" (without \). Than you escape \ with another \ => "\\x80" outputs "\x80".
So to summarize everything:
\x80 is hex character, and when you write it inside string, you write \\x80.
Just some fun:
PHP that outputs js function to alert \x80:
echo "function alertHex(){
alert('\\\\x80 - \\\\xFF');
}";
Why 4 x \? First you escape PHP string to get alert('\\x80 - \\xFF'), that you escape JS string to get \x80 - \xFF.
Same with preg_replace: Allowed symbols: \, $, a-z, [, ]: patern: \\\$[a-z]\[\]; preg_replace('\\\\\$[a-z]\\[\\]', '', $str);
I'm not sure on this and think this is impossible, but I thought I'd ask anyway.
I would like to use a regex delimeter that is a metachar. Example would be
brackets, parenthesis, etc.. [ ], ( ), ...
but anything really.
Its not that I need to do it, its that I'm trying to write an escaping routine as part of a project.
So, whats the problem? The problem comes in the regex body when its not really a metachar
its a literal, like:
/ \( \) / where the forward slash delimeters are to be replaced with ( and )
In Perl for instance, these won't work
=~ m( \( \) )
=~ m( \\( \\) )
=~ m( \\\( \\\) )
=~ m( \\\\( \\\\) )
No amount of escaping the parenthesis will alow a single backslash, ie a literal \(
The backslash on the delimeter is always removed, the remainder of backslashes are then subject to normal quoting rules. This always results in an even number of backslashes.
PHP is apparently the same way.
Like I said, I wouldn't use meta characters as delimeters in normal operation, this
is just a utility I'm tring to write (which seems in jepardy right now).
I'm trying to use just basic escaping rules and want avoid having to scan the string
ahead of time comparing selected delimeters for literal (escaped) meta characters in
the regex text body.
Perl uses q() and qq() that does this correctly (not qr() unfortunately).
It does this by removing escapes on escapes and escapes on delimeters at the same time.
So q( \\\( \\\) ) results in \( \).
Thanks for any help.
Edit
After some research I found this to be impossible, so utility is scrapped.
Thanks for the valuable input though. I'm fairly impressed with Perl's array of
quoting options, especially 'quote like operators' which does the job
but the delimeter is then really for the quote operator and not a regex.
[ I'm not sure if you're asking about Perl or PHP. I just know about Perl ]
Regex literals are parsed twice, once by the Perl compiler and once by the regex compiler.
The Perl parser finds the end of the literal while handling interpolation, escaped delimiters and sequences like \Q and \L. This produces the regex pattern (as a string) and the matching options (e.g. case-insensitive matching).
qr/\/\(/ produces the pattern /\( (/ got unescaped). Similarly,
qr(\/\() produces the pattern \/( (( got unescaped).
The regex compiler takes the regex patter and the matching options and returns a compiled regex.
/\( produces a regex that matches exactly /(, while
\/( produces a regex syntax error.
To produce a regex that matches exactly (, you would need to produce the pattern \( or equivalent. Here are your options:
qr/\(/ (Don't use it as a delimiter)
$d='('; qr(\Q$d\E) (Don't use it in the literal)
qr(\Q\(\E) (Use \Q to insert an escape after \( has become ()
qr(\x28) (Use something equivalent)
qr([\(]) (Use it in a way that doesn't require it being escaped)
You best option by far is to simply choose a different delimiter: One that isn't a meta char, or one that's not used in the pattern. This is trivial since it only matters for hardcoded patterns.
I do not know about PHP, but you can use the \Q in Perl:
"()" =~ m(\Q\(\)\E) and print "YES\n"
Using one-member character classes should work in both Perl and PHP:
"()" =~ m([(][)]) and print "YES\n"
Can you develop your example a bit more precise?
Because
If the original string -> '\('
then /[\\][(]/ will match it
As preface, I am new to (and really bad at) writing regular expressions.
I am trying to use a regular expression in the PHP function preg_split, and am looking to delineate by
*
**
`
I'm having trouble because these characters are commands. How can I write a regular expression to do this?
For PCRE and other so-called compatible flavors, you must escape these outside character classes.
. ^ $ * + ? () [ { \ |
The backtick has no special meaning, so you don't need to escape it.
preg_split('/\*{1,2}|`/', $text);
See Demo
Note: For future reference, you may want to look into using preg_quote()
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
preg_split("(?:\*{1,2}|\`)", $string);
I want to declare the same regex pattern for both languages. For TCL I do this
set pattern "\d\s\S"
but for C++ I have to do this for the same pattern
boost::regex pattern("\\d\\s\\S");
otherwise C++ compiler will tell us the following:
warning C4129: 'd' : unrecognized character escape sequence
so why TCL don't try to find \d \s \S escape symbols and just ignores \-s but C++ tries and sucks?
P.S. PHP works as TCL as I remeber.
This is just how C++ and PHP differ; in PHP, the character following a backslash is matched against a small set of special characters (I believe "rnvtx"). If the match fails it will just continue without altering the meaning.
However, C++ expects the character to be in that small set (I think the set is bigger btw) but if the match fails you will see an error instead.
C++ has the concept of Character Escape Sequences. Escape sequences, which take the form \c (the 'c' being a character), are used to define certain special characters within string literals, so it follows that backslashes by themselves must also be escaped to denote that a special character isn't being implied.
^([a-zA-Z0-9!##$%^&*|()_\-+=\[\]{}:;\"',<.>?\/~`]{4,})$
Would this regular expression work for these rules?
Must be atleast 4 characters
Characters can be a mix of alphabet (capitalized/non-capitalized), numeric, and the following characters: ! # # $ % ^ & * ( ) _ - + = | [ { } ] ; : ' " , < . > ? /
It's intended to be a password validator. The language is PHP.
Yes?
Honestly, what are you asking for? Why don't you test it?
If, however, you want suggestions on improving it, some questions:
What is this regex checking for?
Why do you have such a large set of allowed characters?
Why don't you use /\w/ instead of /0-9a-zA-Z_/?
Why do you have the whole thing in ()s? You don't need to capture the whole thing, since you already have the whole thing, and they aren't needed to group anything.
What I would do is check the length separately, and then check against a regex to see if it has any bad characters. Your list of good characters seems to be sufficiently large that it might just be easier to do it that way. But it may depend on what you're doing it for.
EDIT: Now that I know this is PHP-centric, /\w/ is safe because PHP uses the PCRE library, which is not exactly Perl, and in PCRE, \w will not match Unicode word characters. Thus, why not check for length and ensure there are no invalid characters:
if(strlen($string) >= 4 && preg_match('[\s~\\]', $string) == 0) {
# valid password
}
Alternatively, use the little-used POSIX character class [[:graph:]]. It should work pretty much the same in PHP as it does in Perl. [[:graph:]] matches any alphanumeric or punctuation character, which sounds like what you want, and [[:^graph:]] should match the opposite. To test if all characters match graph:
preg('^[[:graph:]]+$', $string) == 1
To test if any characters don't match graph:
preg('[[:^graph:]]', $string) == 0
You forgot the comma (,) and full stop (.) and added the tilde (~) and grave accent (`) that were not part of your specification. Additionally just a few characters inside a character set declaration have to be escaped:
^([a-zA-Z0-9!##$%^&*()|_\-+=[\]{}:;"',<.>?/~`]{4,})$
And that as a PHP string declaration for preg_match:
'/^([a-zA-Z0-9!##$%^&*()|_\\-+=[\\]{}:;"\',<.>?\\/~`]{4,})$/'
I noticed that you essentially have all of ASCII, except for backslash, space and the control characters at the start, so what about this one, instead?
^([!-\[\]-~]{4,})$
You are extra escaping and aren't using some predefined character classes (such as \w, or at least \d).
Besides of that and that you are anchoring at the beginning and at the end, meaning that the regex will only match if the string starts and ends matching, it looks correct:
^([a-zA-Z\d\-!$##$%^&*()|_+=\[\]{};,."'<>?/~`]{4,})$
If you really mean to use this as a password validator, it reeks of insecurity:
Why are you allowing 4 chars passwords?
Why are you forbidding some characters? PHP can't handle some? Why would you care? Let the user enter the characters he pleases, after all you'll just end up storing a hash + salt of it.
No. That regular expression would not work for the rules you state, for the simple reason that $ by default matches before the final character if it is a newline. You are allowing password strings like "1234\n".
The solution is simple. Either use \z instead of $, or apply the D modifier to the regex.