In the top-voted answer to this fantastic question, the following regular expression is used in a preg_replace call (from the answer's auto_version function):
'{\\.([^./]+)$}'
The end goal of this regular expression is to extract the file's extension from the given filename. However, I'm confused about why the very beginning of this regular expression works. Namely:
Why does \\. match the same way as \. in a regex?
Shouldn't the former match (a) one literal backslash, followed by (b) any character, while the second matches one literal period? The rules for single quoted strings state that \\ yields a literal backslash.
Consider this simple example:
$regex1 = '{\.([^./]+)$}'; // Variant 1 (one backslash)
$regex2 = '{\\.([^./]+)$}'; // Variant 2 (two backslashes)
$subject1 = '/css/foobar.css'; // Regular path
$subject2 = '/css/foobar\\.css'; // Literal backslash before period
echo "<pre>\n";
echo "Subject 1: $subject1\n";
echo "Subject 2: $subject2\n\n";
echo "Regex 1: $regex1\n";
echo "Regex 2: $regex2\n\n";
// Test Variant 1
echo preg_replace($regex1, "-test.\$1", $subject1) . "\n";
echo preg_replace($regex1, "-test.\$1", $subject2) . "\n\n";
// Test Variant 2
echo preg_replace($regex2, "-test.\$1", $subject1) . "\n";
echo preg_replace($regex2, "-test.\$1", $subject2) . "\n\n";
echo "</pre>\n";
The output is:
Subject 1: /css/foobar.css
Subject 2: /css/foobar\.css
Regex 1: {\.([^./]+)$} <-- Output matches regex 2
Regex 2: {\.([^./]+)$} <-- Output matches regex 1
/css/foobar-test.css
/css/foobar\-test.css
/css/foobar-test.css
/css/foobar\-test.css
Long story short: why should \\. yield the same matched results in a preg_replace call as \.?
Consider that there is double escaping going on: PHP sees \\. and says "OK, this is really \.". Then the regex engine sees \. and says "OK, this means a literal dot".
If you remove the first backslash, PHP sees \. and says "this is a backslash followed by a random character -- not a single quote or a backslash as per the spec -- so it remains \.". The regex engine again sees \. and gives the same result as above.
An addition to the perfectly correct answer by Jon:
Please consider the usage of the different kind of quotes (" vs '). If you use ' you cannot include control characters (like a new line). With " this is possible, by using the special key combinations \? where ? can be different things (like \n, \t, etc..). So, if you want to have a real \ in your double-quoted string, you need to escape the backslash by using \\. Please note, that this is not necessary, when using single quotes.
Related
Just out of curiosity, I'm trying to figure out which exactly is the right way to escape a backslash for use in a PHP regular expression pattern like so:
TEST 01: (3 backslashes)
$pattern = "/^[\\\]{1,}$/";
$string = '\\';
// ----- RETURNS A MATCH -----
TEST 02: (4 backslashes)
$pattern = "/^[\\\\]{1,}$/";
$string = '\\';
// ----- ALSO RETURNS A MATCH -----
According to the articles below, 4 is supposedly the right way but what confuses me is that both tests returned a match. If both are right, then is 4 the preferred way?
RESOURCES:
http://www.developwebsites.net/match-backslash-preg_match-php/
Can't escape the backslash with regex?
// PHP 5.4.1
// Either three or four \ can be used to match a '\'.
echo preg_match( '/\\\/', '\\' ); // 1
echo preg_match( '/\\\\/', '\\' ); // 1
// Match two backslashes `\\`.
echo preg_match( '/\\\\\\/', '\\\\' ); // Warning: No ending delimiter '/' found
echo preg_match( '/\\\\\\\/', '\\\\' ); // 1
echo preg_match( '/\\\\\\\\/', '\\\\' ); // 1
// Match one backslash using a character class.
echo preg_match( '/[\\]/', '\\' ); // 0
echo preg_match( '/[\\\]/', '\\' ); // 1
echo preg_match( '/[\\\\]/', '\\' ); // 1
When using three backslashes to match a '\' the pattern below is interpreted as match a '\' followed by an 's'.
echo preg_match( '/\\\\s/', '\\ ' ); // 0
echo preg_match( '/\\\\s/', '\\s' ); // 1
When using four backslashes to match a '\' the pattern below is interpreted as match a '\' followed by a space character.
echo preg_match( '/\\\\\s/', '\\ ' ); // 1
echo preg_match( '/\\\\\s/', '\\s' ); // 0
The same applies if inside a character class.
echo preg_match( '/[\\\\s]/', ' ' ); // 0
echo preg_match( '/[\\\\\s]/', ' ' ); // 1
None of the above results are affected by enclosing the strings in double instead of single quotes.
Conclusions:
Whether inside or outside a bracketed character class, a literal backslash can be matched using just three backslashes '\\\' unless the next character in the pattern is also backslashed, in which case the literal backslash must be matched using four backslashes.
Recommendation:
Always use four backslashes '\\\\' in a regex pattern when seeking to match a backslash.
Escape sequences.
To avoid this kind of unclear code you can use \x5c
Like this :)
echo preg_replace( '/\x5c\w+\.php$/i', '<b>${0}</b>', __FILE__ );
The thing is, you're using a character class, [], so it doesn't matter how many literal backslashes are embedded in it, it'll be treated as a single backslash.
e.g. the following two regexes:
/[a]/
/[aa]/
are for all intents and purposes identical as far as the regex engine is concerned. Character classes take a list of characters and "collapse" them down to match a single character, along the lines of "for the current character being considered, is it any of the characters listed inside the []?". If you list two backslashes in the class, then it'll be "is the char a blackslash or is it a backslash?".
I've studied this years ago. That's because 1st backslash escapes the 2nd one and they together form a 'true baclkslash' character in pattern and this true one escapes the 3rd one. So it magically makes 3 backslashes work.
However, normal suggestion is to use 4 backslashes instead of the ambiguous 3 backslashes.
If I'm wrong about anything, please feel free to correct me.
The answer https://stackoverflow.com/a/15369828/2311074 is very illustrative, but if you don't know the core problem of backslashes in PHP string you won't understand it at all.
The core problem of backslashen in PHP strings is explained at https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single You may want to pay attention to the last two sentences:
The simplest way to specify a string is to enclose it in single quotes
(the character ').
To specify a literal single quote, escape it with a backslash ().To
specify a literal backslash, double it (\). All other instances of
backslash will be treated as a literal backslash
So in short, two backslashes in a string represent a literal backslash. A single backslash not followed by a ' also represents a literal backslash.
This is a bit odd, but it means a string '\\xxx' and '\xxx' both represent the same string \xxx.
Note, that '\\'xxx' is an invalid string whereas '\'xxx' represents the string 'xxx.
I guess it originates from this: If you want to have a literal single quote, you need to escape it with backslash. So 'hi\'' represents the string hi'.
But now you end up in the situation that you maybe want to create the string hi\ but 'hi\' would not work anymore (invalid string like this without ending '). Therefore, one needed an extra escape to prevent the special meaning from \ Thus, one decided \ escapes \ and hi\ can be written by 'hi\\'.
And this is the reason why '\\\' is the same as '\\\\' (both represent \\) and for those two strings it does not matter at all what you use.
However, it has the surprising effect, that if you double the strings, they are not the same.
This is because 3 backslashes enclosed in single quotes represent 2 literal backslashes. But 6 backslashes enclosed in single quotes represent only 3 literal backslashes. Whereas 4 backslashes enclosed in single quotes represent 2 literal backslashes and 8 backslashes enclosed in single quotes represent 4 literal (see examples from MikeM). Thus, its recommended to always use 4 instead of 3.
You can also use the following
$regexp = <<<EOR
schemaLocation\s*=\s*["'](.*?)["']
EOR;
preg_match_all("/".$regexp."/", $xml, $matches);
print_r($matches);
keywords: dochere, nowdoc
I'll give my example in PHP. I am testing if quoted strings are properly closed (e.g., quoted string must close with double quotes if begins with dq). There must be at least 1 char between the quotes, and that char-set between the quotes cannot include the same start/end quote character. For example:
$myString = "hello";// 'hello' also good but "hello' should fail
if (preg_match("/^(\")?[^\"]+(?(1)\")|(\')?[^\']+(?(1)\')$/", $myString)) {
die('1');
} else {
die('2');
}
// The string '1' is outputted which is correct
I am new to conditional regex but to me it seems that I cannot make the preg_match() any simpler. Is this correct?
To do that, there's no need to use the "conditional feature". But you need to check the string from the start until the end (in other word, you can't do it only checking a part of the string):
preg_match('~\A[^"\']*+(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
If you absolutely want at least one character inside quotes, you need to add these lookaheads (?=[^"]) and (?=[^']):
preg_match('~\A[^"\']*+(?:"(?=[^"])[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'(?=[^\'])[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
details:
~
\A # start of the string
[^"']*+ #"# all that is not a quote
(?:
" #"# opening quote
(?=[^"]) #"# at least one character that isn't a quote
[^"\\]*+ #"# all characters that are not quotes or backslashes
(?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
" #"# closing quote
[^"']*
| #"# or same thing for single quotes
'(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z # end of the string
~s # singleline mode: the dot matches newlines too
demo
Note that these patterns are designed to deal with escaped characters.
Most of the time a conditional can be replaced with a simple alternation.
As an aside: don't believe that shorter patterns are always better than longer patterns, it's a false idea.
Based on the two observations below, I built my regex to be simple and fast, but to not deal with escaped quotes
The OP was asked specifically whether the string $str = "hello, I
said: \"How are you?\"" would be invalid and did not respond
The OP mentioned performance (efficiency as a criterion)
I'm also not a fan of code that is tough to read, so I used the <<< Nowdoc notation to avoid having to escape anything in the regex pattern
My solution:
$strings = [
"'hello's the word'",
"'hello is the word'",
'"hello "there" he said"',
'"hello there he said"',
'"Hi',
"'hello",
"no quotes",
"''"
];
$regexp = <<< 'TEXT'
/^('|")(?:(?!\1).)+\1$/
TEXT;
foreach ($strings as $string):
echo "$string - ".(preg_match($regexp,$string)?'true':'false')."<br/>";
endforeach;
Output:
'hello's the word' - false
'hello is the word' - true
"hello "there" he said" - false
"hello there he said" - true
"Hi - false
'hello - false
no quotes - false
'' - false
How it works:
^('|") //starts with single or double-quote
(?: //non-capturing group
(?!\1) //next char is not the same as first single/double quote
. //advance one character
)+ //repeat group with next char (there must be at least one char)
\1$ //End with the same single or double-quote that started the string
I am pretty familiar with PHP, but I am brand new with regular expressions in PHP. I am trying to figure out how to only allow a-z, A-Z, 0-9, :, ' (single quote), " " (double quote), +, -, ., (comma), &, !, *, (, and ).
I have found several working examples of what I am looking for EXCEPT how to allow the single quote and the double quote.
An Example of what I am looking for is:
Hello, this is just an example of what I am looking for: "Hello World!".
I am trying to validate a textarea $_POST['suggestion'] using:
$errors = array();
if(!preg_match('insert regular expression',$_POST['suggestion'])){
$errors['suggestion2'] = "Invalid";
}
With everything I have tried, I always get:
An Example of what I am looking for is: Hello, this is just an example of what I am looking for: \"Hello Wolrd!\".
I don't understand why the \ are in front of the quotes?
You can use the following regex:
[a-zA-Z0-9:'"+.,&!*()-]
Note that the hyphen - is placed at the end position so as not to form a range (and it can match a literal -). +, *, ., ( and ) do not have to be escaped inside a character class. Generally, ^-] should be escaped, but if they appear at the start of final position in the character class, they do not have to. \ must be escaped in the character class, but you are not allowing it.
Also, if you want to match chunks of allowed symbols, add a + quantifier after the character class: [a-zA-Z0-9:'"+.,&!*()-]+.
See demo here and here.
Sample PHP code:
$re = "/[a-zA-Z0-9:'\"+.,&!*()-]/";
$str = "a-zA-Z0-9:'\"+.,&!*()-";
preg_match_all($re, $str, $matches);
EDIT:
Since you updated the question, here is the information to turn off escaping double quotes in earlier PHP versions. As one of the options, you may go to .htaccess file and set php_flag magic_quotes_gpc Off.
why is it that in searching for a backslash in a regex you need to escape the backslash 4 times?
Example:
$pattern = '/\\\\/';
$string = 'to\m';
preg_match( $pattern, $string, $matches );
echo "<pre>";
print_r($matches);
echo "</pre>";
Returns:
Array
(
[0] => \
)
Because there are two levels of parsing being done, once by PHP, and a second time by the regular expression engine:
The intended target: \
Well I need to put that in a string without it escaping the character after it: "\\", PHP sees \
Now I need to feed that into a regex: "\\\\" PHP sees \\, regex engine sees \
The function preg_quote() will remove a layer of confusion for you by escaping all regular expression metacharacters for you. eg:
$foo = preg_quote("c:\\some\\path\\or_whatever");
preg_match("/$foo/", $bar);
edit
You seem to be thinking of this as "units of \\", which doesn't seem like an accurate depiction of what is happening. For a better example let's use a different character that is also significant in both PHP and regular expressions, $.
Intended target: $
Escaping for a PHP string: "\$", the literal string seen by PHP is $
Escaping for a PHP string to be interpreted as a literal $ in a regular expression:
"\\\$", PHP sees the literal string \$, the regular expression sees the literal string $
Illustrated with different styles of braces representing different levels of escaping:
0: $ $
1: \$ [\$]
2: \\\\ [{\\}{\$}]
0: \ \
1: \\ [\\]
2: \\\\ [{\\}{\\}]
0: \\server\$c\Windows
1: [\\][\\]server[\\][\$]c[\\]Windows
2: [{\\}{\\}][{\\}{\\}]server[{\\}{\\}][{\\}{\$}]c[{\\}{\\}]Windows
Which also illustrates why dealing with Windows paths sucks butts.
This is because the backslash has a special meaning in both a php string and a regular expression, so you must escape it twice:
To match a single backslash, the pure regex should be:
/\\/
If it was:
/\/
, the backslash would be escaping the forward slash, leading to an invalid regex matching a single forward slash, but missing it's ending slash.
Then, this pure regex is put into a php string, and each backslash is again escaped:
'/\\\\/'
Because a backslash is a special character, you need to escape it twice. So \\ for the first backslash, and \\ for the second.
I'm after a bit of regex to be used in PHP to validate a UNC path passed through a form. It should be of the format:
\\server\something
... and allow for further sub-folders. It might be good to strip off a trailing slash for consistency although I can easily do this with substr if need be.
I've read online that matching a single backslash in PHP requires 4 backslashes (when using a "C like string") and think I understand why that is (PHP escaping (e.g. 2 = 1, so 4 = 2), then regex engine escaping (the remaining 2 = 1). I've seen the following two quoted as equivalent suitable regex to match a single backslash:
$regex = "/\\\\/s";
or apparently this also:
$regex = "/[\\]/s";
However these produce different results, and that is slightly aside from my final aim to match a complete UNC path.
To see if I could match two backslashes I used the following to test:
$path = "\\\\server";
echo "the path is: $path <br />"; // which is \\server
$regex = "/\\\\\\\\\/s";
if (preg_match($regex, $path))
{
echo "matched";
}
else
{
echo "not matched";
}
The above however seems to match on two or more backslashes :( The pattern is 8 slashes, translating to 2, so why would an input of 3 backslashes ($path = "\\\\\\server") match?
I thought perhaps the following would work:
$regex = "/[\\][\\]/s";
and again, no :(
Please help before I jump out a window lol :)
Use this little gem:
$UNC_regex = '=^\\\\\\\\[a-zA-Z0-9-]+(\\\\[a-zA-Z0-9`~!##$%^&(){}\'._-]+([ ]+[a-zA-Z0-9`~!##$%^&(){}\'._-]+)*)+$=s';
Source: http://regexlib.com/REDetails.aspx?regexp_id=2285 (adopted to PHP string escaping)
The RegEx shown above matches for valid hostname (which allows only a few valid characters) and the path part behind the hostname (which allows many, but not all characters)
Sidenote on the backslashes issue:
When you use double quotes (") to enclose your string, you must be aware of PHP special character escaping.. "\\" is a single \ in PHP.
Important: even with single quotes (') those backslashes must be escaped.
A PHP string with single quotes takes everything in the string literally (unescaped) with a few exceptions:
A backslash followed by a backslash (\\) is interpreted as a single backslash.
('C:\\*.*' => C:\*.*)
A backslash followed by a single-quote (\') is interpreted as a single quote.
('I\'ll be back' => I'll be back)
A backslash followed by anything else is interpreted as a backslash.
('Just a \ somewhere' => Just a \ somewhere)
Also, you must be aware of PCRE escape sequences.
The RegEx parser treats \ for character classes, so you need to escape it for RegEx, again.
To match two \\ you must write $regex = "\\\\\\\\" or $regex = '\\\\\\\\'
From the PHP docs on PCRE escape sequences:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \, then "\\" or '\\' must be used in PHP code.
Regarding your Question:
why would an input of 3 backslashes ($path = "\\\server") match with regex "/\\\\\\\\/s"?
The reason is that you have no boundaries defined (use ^ for beginning and $ for end of string), thus it finds \\ "somewhere" resulting in a positive match. To get the expected result, you should do something like this:
$regex = '/^\\\\\\\\[^\\\\]/s';
The RegEx above has 2 modifications:
^ at the beginning to only match two \\ at the beginning of the string
[^\\] negative character class to say: not followed by an additional backslash
Regarding your last RegEx:
$regex = "/[\\][\\]/s";
You have a confusion (see above for clarification) with backslash escaping here. "/[\\][\\]/s" is interpreted by PHP to /[\][\]/s, which will let the RegEx fail because \ is a reserved character in RegEx and thus must be escaped.
This variant of your RegEx would work, but also match any occurance of two backslashes for the same reason i already explained above:
$regex = '/[\\\\][\\\\]/s';
Echo your regex as well, so you see what's the actual pattern, writing those slashes inside PHP can become akward for the pattern, so you can verify it's correct.
Also you should put ^ at the beginning of the pattern to match from string start and $ to the end to specify that the whole string has to be matched.
\\server\something
Regex:
~^\\\\server\\something$~
PHP String:
$pattern = '~^\\\\\\\\server\\\\something$~';
For the repetition, you want to say that a server exists and it's followed by one or more \something parts. If server is like something, this can be simplified:
^\\(?:\\[a-z]+){2,}$
PHP String:
$pattern = '~^\\\\(?:\\\\[a-z]+){2,}$~';
As there was some confusion about how \ characters should be written inside single quoted strings:
# Output:
#
# * Definition as '\\' ....... results in string(1) "\"
# * Definition as '\\\\' ..... results in string(2) "\\"
# * Definition as '\\\\\\' ... results in string(3) "\\\"
$slashes = array(
'\\',
'\\\\',
'\\\\\\',
);
foreach($slashes as $i => $slashed) {
$definition = sprintf('%s ', var_export($slashed, 1));
ob_start();
var_dump($slashed);
$result = rtrim(ob_get_clean());
printf(" * Definition as %'.-12s results in %s\n", $definition, $result);
}