PHP Regex for matching a UNC path - php

I'm after a bit of regex to be used in PHP to validate a UNC path passed through a form. It should be of the format:
\\server\something
... and allow for further sub-folders. It might be good to strip off a trailing slash for consistency although I can easily do this with substr if need be.
I've read online that matching a single backslash in PHP requires 4 backslashes (when using a "C like string") and think I understand why that is (PHP escaping (e.g. 2 = 1, so 4 = 2), then regex engine escaping (the remaining 2 = 1). I've seen the following two quoted as equivalent suitable regex to match a single backslash:
$regex = "/\\\\/s";
or apparently this also:
$regex = "/[\\]/s";
However these produce different results, and that is slightly aside from my final aim to match a complete UNC path.
To see if I could match two backslashes I used the following to test:
$path = "\\\\server";
echo "the path is: $path <br />"; // which is \\server
$regex = "/\\\\\\\\\/s";
if (preg_match($regex, $path))
{
echo "matched";
}
else
{
echo "not matched";
}
The above however seems to match on two or more backslashes :( The pattern is 8 slashes, translating to 2, so why would an input of 3 backslashes ($path = "\\\\\\server") match?
I thought perhaps the following would work:
$regex = "/[\\][\\]/s";
and again, no :(
Please help before I jump out a window lol :)

Use this little gem:
$UNC_regex = '=^\\\\\\\\[a-zA-Z0-9-]+(\\\\[a-zA-Z0-9`~!##$%^&(){}\'._-]+([ ]+[a-zA-Z0-9`~!##$%^&(){}\'._-]+)*)+$=s';
Source: http://regexlib.com/REDetails.aspx?regexp_id=2285 (adopted to PHP string escaping)
The RegEx shown above matches for valid hostname (which allows only a few valid characters) and the path part behind the hostname (which allows many, but not all characters)
Sidenote on the backslashes issue:
When you use double quotes (") to enclose your string, you must be aware of PHP special character escaping.. "\\" is a single \ in PHP.
Important: even with single quotes (') those backslashes must be escaped.
A PHP string with single quotes takes everything in the string literally (unescaped) with a few exceptions:
A backslash followed by a backslash (\\) is interpreted as a single backslash.
('C:\\*.*' => C:\*.*)
A backslash followed by a single-quote (\') is interpreted as a single quote.
('I\'ll be back' => I'll be back)
A backslash followed by anything else is interpreted as a backslash.
('Just a \ somewhere' => Just a \ somewhere)
Also, you must be aware of PCRE escape sequences.
The RegEx parser treats \ for character classes, so you need to escape it for RegEx, again.
To match two \\ you must write $regex = "\\\\\\\\" or $regex = '\\\\\\\\'
From the PHP docs on PCRE escape sequences:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \, then "\\" or '\\' must be used in PHP code.
Regarding your Question:
why would an input of 3 backslashes ($path = "\\\server") match with regex "/\\\\\\\\/s"?
The reason is that you have no boundaries defined (use ^ for beginning and $ for end of string), thus it finds \\ "somewhere" resulting in a positive match. To get the expected result, you should do something like this:
$regex = '/^\\\\\\\\[^\\\\]/s';
The RegEx above has 2 modifications:
^ at the beginning to only match two \\ at the beginning of the string
[^\\] negative character class to say: not followed by an additional backslash
Regarding your last RegEx:
$regex = "/[\\][\\]/s";
You have a confusion (see above for clarification) with backslash escaping here. "/[\\][\\]/s" is interpreted by PHP to /[\][\]/s, which will let the RegEx fail because \ is a reserved character in RegEx and thus must be escaped.
This variant of your RegEx would work, but also match any occurance of two backslashes for the same reason i already explained above:
$regex = '/[\\\\][\\\\]/s';

Echo your regex as well, so you see what's the actual pattern, writing those slashes inside PHP can become akward for the pattern, so you can verify it's correct.
Also you should put ^ at the beginning of the pattern to match from string start and $ to the end to specify that the whole string has to be matched.
\\server\something
Regex:
~^\\\\server\\something$~
PHP String:
$pattern = '~^\\\\\\\\server\\\\something$~';
For the repetition, you want to say that a server exists and it's followed by one or more \something parts. If server is like something, this can be simplified:
^\\(?:\\[a-z]+){2,}$
PHP String:
$pattern = '~^\\\\(?:\\\\[a-z]+){2,}$~';
As there was some confusion about how \ characters should be written inside single quoted strings:
# Output:
#
# * Definition as '\\' ....... results in string(1) "\"
# * Definition as '\\\\' ..... results in string(2) "\\"
# * Definition as '\\\\\\' ... results in string(3) "\\\"
$slashes = array(
'\\',
'\\\\',
'\\\\\\',
);
foreach($slashes as $i => $slashed) {
$definition = sprintf('%s ', var_export($slashed, 1));
ob_start();
var_dump($slashed);
$result = rtrim(ob_get_clean());
printf(" * Definition as %'.-12s results in %s\n", $definition, $result);
}

Related

Filename Regular Expression in PHP with Named Capture [duplicate]

Just out of curiosity, I'm trying to figure out which exactly is the right way to escape a backslash for use in a PHP regular expression pattern like so:
TEST 01: (3 backslashes)
$pattern = "/^[\\\]{1,}$/";
$string = '\\';
// ----- RETURNS A MATCH -----
TEST 02: (4 backslashes)
$pattern = "/^[\\\\]{1,}$/";
$string = '\\';
// ----- ALSO RETURNS A MATCH -----
According to the articles below, 4 is supposedly the right way but what confuses me is that both tests returned a match. If both are right, then is 4 the preferred way?
RESOURCES:
http://www.developwebsites.net/match-backslash-preg_match-php/
Can't escape the backslash with regex?
// PHP 5.4.1
// Either three or four \ can be used to match a '\'.
echo preg_match( '/\\\/', '\\' ); // 1
echo preg_match( '/\\\\/', '\\' ); // 1
// Match two backslashes `\\`.
echo preg_match( '/\\\\\\/', '\\\\' ); // Warning: No ending delimiter '/' found
echo preg_match( '/\\\\\\\/', '\\\\' ); // 1
echo preg_match( '/\\\\\\\\/', '\\\\' ); // 1
// Match one backslash using a character class.
echo preg_match( '/[\\]/', '\\' ); // 0
echo preg_match( '/[\\\]/', '\\' ); // 1
echo preg_match( '/[\\\\]/', '\\' ); // 1
When using three backslashes to match a '\' the pattern below is interpreted as match a '\' followed by an 's'.
echo preg_match( '/\\\\s/', '\\ ' ); // 0
echo preg_match( '/\\\\s/', '\\s' ); // 1
When using four backslashes to match a '\' the pattern below is interpreted as match a '\' followed by a space character.
echo preg_match( '/\\\\\s/', '\\ ' ); // 1
echo preg_match( '/\\\\\s/', '\\s' ); // 0
The same applies if inside a character class.
echo preg_match( '/[\\\\s]/', ' ' ); // 0
echo preg_match( '/[\\\\\s]/', ' ' ); // 1
None of the above results are affected by enclosing the strings in double instead of single quotes.
Conclusions:
Whether inside or outside a bracketed character class, a literal backslash can be matched using just three backslashes '\\\' unless the next character in the pattern is also backslashed, in which case the literal backslash must be matched using four backslashes.
Recommendation:
Always use four backslashes '\\\\' in a regex pattern when seeking to match a backslash.
Escape sequences.
To avoid this kind of unclear code you can use \x5c
Like this :)
echo preg_replace( '/\x5c\w+\.php$/i', '<b>${0}</b>', __FILE__ );
The thing is, you're using a character class, [], so it doesn't matter how many literal backslashes are embedded in it, it'll be treated as a single backslash.
e.g. the following two regexes:
/[a]/
/[aa]/
are for all intents and purposes identical as far as the regex engine is concerned. Character classes take a list of characters and "collapse" them down to match a single character, along the lines of "for the current character being considered, is it any of the characters listed inside the []?". If you list two backslashes in the class, then it'll be "is the char a blackslash or is it a backslash?".
I've studied this years ago. That's because 1st backslash escapes the 2nd one and they together form a 'true baclkslash' character in pattern and this true one escapes the 3rd one. So it magically makes 3 backslashes work.
However, normal suggestion is to use 4 backslashes instead of the ambiguous 3 backslashes.
If I'm wrong about anything, please feel free to correct me.
The answer https://stackoverflow.com/a/15369828/2311074 is very illustrative, but if you don't know the core problem of backslashes in PHP string you won't understand it at all.
The core problem of backslashen in PHP strings is explained at https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single You may want to pay attention to the last two sentences:
The simplest way to specify a string is to enclose it in single quotes
(the character ').
To specify a literal single quote, escape it with a backslash ().To
specify a literal backslash, double it (\). All other instances of
backslash will be treated as a literal backslash
So in short, two backslashes in a string represent a literal backslash. A single backslash not followed by a ' also represents a literal backslash.
This is a bit odd, but it means a string '\\xxx' and '\xxx' both represent the same string \xxx.
Note, that '\\'xxx' is an invalid string whereas '\'xxx' represents the string 'xxx.
I guess it originates from this: If you want to have a literal single quote, you need to escape it with backslash. So 'hi\'' represents the string hi'.
But now you end up in the situation that you maybe want to create the string hi\ but 'hi\' would not work anymore (invalid string like this without ending '). Therefore, one needed an extra escape to prevent the special meaning from \ Thus, one decided \ escapes \ and hi\ can be written by 'hi\\'.
And this is the reason why '\\\' is the same as '\\\\' (both represent \\) and for those two strings it does not matter at all what you use.
However, it has the surprising effect, that if you double the strings, they are not the same.
This is because 3 backslashes enclosed in single quotes represent 2 literal backslashes. But 6 backslashes enclosed in single quotes represent only 3 literal backslashes. Whereas 4 backslashes enclosed in single quotes represent 2 literal backslashes and 8 backslashes enclosed in single quotes represent 4 literal (see examples from MikeM). Thus, its recommended to always use 4 instead of 3.
You can also use the following
$regexp = <<<EOR
schemaLocation\s*=\s*["'](.*?)["']
EOR;
preg_match_all("/".$regexp."/", $xml, $matches);
print_r($matches);
keywords: dochere, nowdoc

Double quotes in PHP regular expression?

I am pretty familiar with PHP, but I am brand new with regular expressions in PHP. I am trying to figure out how to only allow a-z, A-Z, 0-9, :, ' (single quote), " " (double quote), +, -, ., (comma), &, !, *, (, and ).
I have found several working examples of what I am looking for EXCEPT how to allow the single quote and the double quote.
An Example of what I am looking for is:
Hello, this is just an example of what I am looking for: "Hello World!".
I am trying to validate a textarea $_POST['suggestion'] using:
$errors = array();
if(!preg_match('insert regular expression',$_POST['suggestion'])){
$errors['suggestion2'] = "Invalid";
}
With everything I have tried, I always get:
An Example of what I am looking for is: Hello, this is just an example of what I am looking for: \"Hello Wolrd!\".
I don't understand why the \ are in front of the quotes?
You can use the following regex:
[a-zA-Z0-9:'"+.,&!*()-]
Note that the hyphen - is placed at the end position so as not to form a range (and it can match a literal -). +, *, ., ( and ) do not have to be escaped inside a character class. Generally, ^-] should be escaped, but if they appear at the start of final position in the character class, they do not have to. \ must be escaped in the character class, but you are not allowing it.
Also, if you want to match chunks of allowed symbols, add a + quantifier after the character class: [a-zA-Z0-9:'"+.,&!*()-]+.
See demo here and here.
Sample PHP code:
$re = "/[a-zA-Z0-9:'\"+.,&!*()-]/";
$str = "a-zA-Z0-9:'\"+.,&!*()-";
preg_match_all($re, $str, $matches);
EDIT:
Since you updated the question, here is the information to turn off escaping double quotes in earlier PHP versions. As one of the options, you may go to .htaccess file and set php_flag magic_quotes_gpc Off.

regex searching for a backslash

why is it that in searching for a backslash in a regex you need to escape the backslash 4 times?
Example:
$pattern = '/\\\\/';
$string = 'to\m';
preg_match( $pattern, $string, $matches );
echo "<pre>";
print_r($matches);
echo "</pre>";
Returns:
Array
(
[0] => \
)
Because there are two levels of parsing being done, once by PHP, and a second time by the regular expression engine:
The intended target: \
Well I need to put that in a string without it escaping the character after it: "\\", PHP sees \
Now I need to feed that into a regex: "\\\\" PHP sees \\, regex engine sees \
The function preg_quote() will remove a layer of confusion for you by escaping all regular expression metacharacters for you. eg:
$foo = preg_quote("c:\\some\\path\\or_whatever");
preg_match("/$foo/", $bar);
edit
You seem to be thinking of this as "units of \\", which doesn't seem like an accurate depiction of what is happening. For a better example let's use a different character that is also significant in both PHP and regular expressions, $.
Intended target: $
Escaping for a PHP string: "\$", the literal string seen by PHP is $
Escaping for a PHP string to be interpreted as a literal $ in a regular expression:
"\\\$", PHP sees the literal string \$, the regular expression sees the literal string $
Illustrated with different styles of braces representing different levels of escaping:
0: $ $
1: \$ [\$]
2: \\\\ [{\\}{\$}]
0: \ \
1: \\ [\\]
2: \\\\ [{\\}{\\}]
0: \\server\$c\Windows
1: [\\][\\]server[\\][\$]c[\\]Windows
2: [{\\}{\\}][{\\}{\\}]server[{\\}{\\}][{\\}{\$}]c[{\\}{\\}]Windows
Which also illustrates why dealing with Windows paths sucks butts.
This is because the backslash has a special meaning in both a php string and a regular expression, so you must escape it twice:
To match a single backslash, the pure regex should be:
/\\/
If it was:
/\/
, the backslash would be escaping the forward slash, leading to an invalid regex matching a single forward slash, but missing it's ending slash.
Then, this pure regex is put into a php string, and each backslash is again escaped:
'/\\\\/'
Because a backslash is a special character, you need to escape it twice. So \\ for the first backslash, and \\ for the second.

Backslash in Regex- PHP

I am trying to learn Regex in PHP and stuck in here now. My ques may appear silly but pls do explain.
I went through a link:
Extra backslash needed in PHP regexp pattern
But I just could not understand something:
In the answer he mentions two statements:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
My ques:
what does the word "unescaping" actually means? what is the purpose of unescaping?
Why do we need 4 backslashes to include it in the regex?
The backslash has a special meaning in both regexen and PHP. In both cases it is used as an escape character. For example, if you want to write a literal quote character inside a PHP string literal, this won't work:
$str = ''';
PHP would get "confused" which ' ends the string and which is part of the string. That's where \ comes in:
$str = '\'';
It escapes the special meaning of ', so instead of terminating the string literal, it is now just a normal character in the string. There are more escape sequences like \n as well.
This now means that \ is a special character with a special meaning. To escape this conundrum when you want to write a literal \, you'll have to escape literal backslashes as \\:
$str = '\\'; // string literal representing one backslash
This works the same in both PHP and regexen. If you want to write a literal backslash in a regex, you have to write /\\/. Now, since you're writing your regexen as PHP strings, you need to double escape them:
$regex = '/\\\\/';
One pair of \\ is first reduced to one \ by the PHP string escaping mechanism, so the actual regex is /\\/, which is a regex which means "one backslash".
I think you can use "preg_quote()":
http://php.net/preg_quote
This function escapes special chars, so you can give an input as it is, without escaping by yourself:
<?php
$string = "online 24/7. Only for \o/";
$escaped_string = preg_quote($string, "/"); // 2nd param is optional and used if you want to escape also the delimiter of your regex
echo $escaped_string; // $escaped_string: "online 24\/7. Only for \\o\/"
?>

PCRE regex with lookahead and lookbehind always returns true

I’m trying to create a regex for form validation but it always returns true. The user must be able to add something like {user|2|S} as input but also use brackets if they are escaped with \.
This code checks for the left bracket { for now.
$regex = '/({(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)}))|[^{]|(?<=\\\){)*/';
if (preg_match($regex, $value)) {
return TRUE;
} else {
return FALSE;
}
A possible correct input would be:
Hello {user|1|S}, you have {amount|2|D2}
or
Hello {user|1|S}, you have {amount|2|D2} in \{the_bracket_bank\}
However, this should return false:
Hello {user|1|S}, you have {amount|2}
and this also:
Hello {user|1|S}, you have {amount|2|D2} in {the_bracket_bank}
A live example can be found here: http://regexr.com?37tpu Note that there is a \ in the lookbehind at the end, PHP was giving me error messages because I had to escape it an extra time in my code.
The main error is that you do not specify that the regex should match from the beginning to the of the checked string. Use the ^ and $ assertions.
I think you have to escape { and } in your regex as they have special meaning. Together they form a quantifier.
The (?<=\\\) is better written (?<=\\\\). The backslash has to be double escaped as it has special meaning in both single-quoted string and PCRE regex. Using \\\ works too, because if single-quoted string contains any escape sequence except \\ and \', it handles it as literal backslash and letter, therefore \) is taken literally. But explicitly escaping the backslash twice seems easier to read to me.
The regex should be
$regex = '/^(\{(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\}))|[^{]|(?<=\\\\)\{)*$/';
But notice that the look-around assertions are not necessary. This regex should do the job too:
$regex = '/^([^{]|\\\{|\{[a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\})*$/';
Any non-{ characters are matched by the first alternative. When a { is read, one of the remaining two alternatives is used. Either the pattern for the brace thing matches, or the regex engine backtracks one character and tries to match \{ character sequence. If it fails, both ways, it backtracks further till it reaches string start and fails completely.
Matching without lookbehind
You can make a regex for this without using lookbehind/lookaheads (which is usually recommended).
For example, if your requirement is that you can match any character but a { and a } unless it's preceded by a \. You can also say:
Match any character but a { and a } OR match a \{ or a \}. To match any character but a { and a } use:
[^{}]
To match a \{ use:
\\\{
One backslash is for escaping the { (which might not be necessary, depending on your regex compiler) and one backslash is for escaping the other backslash.
You would end up with this:
(?:
[^{}]
|
\\\{
|
\\\}
)+
I nicely formatted this regex so that it's readable. If you want to use it in your code like this make sure to use the [PCRE_EXTENDED][1] modifier.
Looks more of a job for a lookbehind to me:
/((?<!\\\\)\{[a-zA-Z0-9]+\|[0-9]+\|[SD][0-9]*\})/
However, the obfuscation factor is so high that I would rather recognize all bracketed strings and parse them later.

Categories