Regex for matching single-quoted strings fails with PHP - php

So I have this regex:
/'((?:[^\\']|\\.)*)'/
It is supposed to match single-quoted strings while ignoring internal, escaped single quotes \'
It works here, but when executed with PHP, I get different results. Why is that?

This might be easier using negative lookbehind. Note also that you need to escape the slashes twice - once to tell PHP that you want a literal backslash, and then again to tell the regex engine that you want a literal backslash.
Note also that your capturing expression (.*) is greedy - it will capture everything between ' characters, including other ' characters, whether they are escaped or not. If you want it to stop after the first unescaped ', use .*? instead. I have used the non-greedy version in my example below.
<?php
$test = "This is a 'test \' string' for regex selection";
$pattern = "/(?<!\\\\)'(.*?)(?<!\\\\)'/";
echo "Test data: $test\n";
echo "Pattern: $pattern\n";
if (preg_match($pattern, $test, $matches)) {
echo "Matches:\n";
var_dump($matches);
}

This is kinda escaping hell. Despite the fact that there's already an accepted answer, the original pattern is actually better. Why? It allows escaping the escape character using the
Unrolling the loop technique described by Jeffery Friedl in "Mastering Regular Expressions": "([^\\"]*(?:\\.[^\\"]*)*)" (adapted for single quotes)
Demo
Unrolling the Loop (using double quotes)
" # the start delimiter
([^\\"]* # anything but the end of the string or the escape char
(?:\\. # the escape char preceding an escaped char (any char)
[^\\"]* # anything but the end of the string or the escape char
)*) # repeat
" # the end delimiter
This does not resolve the escaping hell but you have been covered here as well:
Sample Code:
$re = '/\'([^\\\\\']*(?:\\\\.[^\\\\\']*)*)\'/';
$str = '\'foo\', \'can\\\'t\', \'bar\'
\'foo\', \' \\\'cannott\\\'\\\\\', \'bar\'
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);

Related

Use preg_replace() to add two backslashes before each match

I have code below, what I need to change to get result mercedes\\-benz instead of mercedes\-benz
$value = 'mercedes-benz';
$pattern = '/(\+|-|\/|&&|\|\||!|\(|\)|\{|}|\[|]|\^|"|~|\*|\?|:|\\\)/';
$replace = '\\\\${1}';
echo preg_replace($pattern, $replace, $value);
Welcome to the joys of "leaning toothpick syndrome" - backslash is such a commonly used escape character that it frequently requires escaping multiple times. Let's have a look at your case:
Required output (presumably because of some other escaping context): \\
Escape each \ with an additional \ for use in the PCRE regex engine: \\\\
Escape each \ there for use in a PHP string: \\\\\\\\
$value = 'mercedes-benz';
$pattern = '/(\+|-|\/|&&|\|\||!|\(|\)|\{|}|\[|]|\^|"|~|\*|\?|:|\\\)/';
$replace = '\\\\\\\\${1}';
echo preg_replace($pattern, $replace, $value);
As mickmackusa points out, you can get away with six rather than eight backslashes in some cases, such as a replacement of '\\\\\\'; this works because the regex engine sees \\\, which is an escaped backslash (\\) followed by a single backslash (\) that can't be escaping anything because it's the end of the string. Simply doubling for each "layer" of escaping is probably safer than learning when this short-cut is and isn't valid, though.
I can't be sure that I've 100% translated your original attempt, but this works for your lone sample input.
The pattern uses a character class and curly braced quantifiers to improve readability and brevity. Using \K eliminates the need for the reference in the replacement string.
Code: (Demo)
$value = 'mercedes-benz';
$pattern = '`&{2}|\|{2}|[-+/!(){}[\]^"~*?:\\\]\K`';
$replace = '\\\\\\';
echo preg_replace($pattern, $replace, $value);
Ultimately, the trick was to keep adding backslashes to the replacement to get them to show up.

php -> preg_replace -> remove space ONLY between quotes

I'm trying to remove space ONLY between quotes like:
$text = 'good with spaces "here all spaces should be removed" and here also good';
can someone help with a working piece of code ? I already tried:
$regex = '/(\".+?\")|\s/';
or
$regex = '/"(?!.?\s+.?)/';
without success, and I found a sample that works in the wrong direction :-(
Removing whitespace-characters, except inside quotation marks in PHP? but I can't change it.
thx Newi
This kind of problem are easily solved with preg_replace_callback. The idea consists to extract the substring between quotes and then to edit it in the callback function:
$text = preg_replace_callback('~"[^"]*"~', function ($m) {
return preg_replace('~\s~', '#', $m[0]);
}, $text);
It's the most simple way.
It's more complicated to do it with a single pattern with preg_replace but it's possible:
$text = preg_replace('~(?:\G(?!\A)|")[^"\s]*\K(?:\s|"(*SKIP)(*F))~', '#', $text);
demo
Pattern details:
(?:
\G (?!\A) # match the next position after the last successful match
|
" # or the opening double quote
)
[^"\s]* # characters that aren't double quotes or a whitespaces
\K # discard all characters matched before from the match result
(?:
\s # a whitespace
|
" # or the closing quote
(*SKIP)(*F) # force the pattern to fail and to skip the quote position
# (this way, the closing quote isn't seen as an opening quote
# in the second branch.)
)
This way uses the \G anchors to ensure that all matched whitespaces are between the quotes.
Edge cases:
there's an orphan opening quote: In this case, all whitespaces from the last quote until the end of the string are replaced. But if you want you can change this behavior adding a lookahead to check if the closing quote exists:
~(?:\G(?!\A)|"(?=[^"]*"))[^"\s]*\K(?:\s|"(*SKIP)(*F))~
double quotes can contain escaped double quotes that have to be ignored: You have to describe escaped characters like this:
~(?:\G(?!\A)|")[^"\s\\\\]*+(?:\\\\\S[^"\s\\\\]*)*+(?:\\\\?\K\s|"(*SKIP)(*F))~
Other strategy suggested by #revo: check if the number of remaining quotes at a position is odd or even using a lookahead:
\s(?=[^"]*+(?:"[^"]*"[^"]*)*+")
It is a short pattern, but it can be problematic with long strings since for each position with a whitespace you have to check the string until the last quote with the lookahead.
See the following code snippet:
<?php
$text = 'good with spaces "here all spaces should be removed" and here also good';
echo "$text \n";
$regex = '/(\".+?\")|\s/';
$regex = '/"(?!.?\s+.?)/';
$text = preg_replace($regex,'', $text);
echo "$text \n";
?>
I found a sample that works in the wrong direction :-(
#Graham: correct
$text = 'good with spaces "here all spaces should be removed" and here also good'
should be
$text = 'good with spaces "hereallspacesshouldberemoved" and here also good';

Is this conditional regex the most efficient?

I'll give my example in PHP. I am testing if quoted strings are properly closed (e.g., quoted string must close with double quotes if begins with dq). There must be at least 1 char between the quotes, and that char-set between the quotes cannot include the same start/end quote character. For example:
$myString = "hello";// 'hello' also good but "hello' should fail
if (preg_match("/^(\")?[^\"]+(?(1)\")|(\')?[^\']+(?(1)\')$/", $myString)) {
die('1');
} else {
die('2');
}
// The string '1' is outputted which is correct
I am new to conditional regex but to me it seems that I cannot make the preg_match() any simpler. Is this correct?
To do that, there's no need to use the "conditional feature". But you need to check the string from the start until the end (in other word, you can't do it only checking a part of the string):
preg_match('~\A[^"\']*+(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
If you absolutely want at least one character inside quotes, you need to add these lookaheads (?=[^"]) and (?=[^']):
preg_match('~\A[^"\']*+(?:"(?=[^"])[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'(?=[^\'])[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
details:
~
\A # start of the string
[^"']*+ #"# all that is not a quote
(?:
" #"# opening quote
(?=[^"]) #"# at least one character that isn't a quote
[^"\\]*+ #"# all characters that are not quotes or backslashes
(?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
" #"# closing quote
[^"']*
| #"# or same thing for single quotes
'(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z # end of the string
~s # singleline mode: the dot matches newlines too
demo
Note that these patterns are designed to deal with escaped characters.
Most of the time a conditional can be replaced with a simple alternation.
As an aside: don't believe that shorter patterns are always better than longer patterns, it's a false idea.
Based on the two observations below, I built my regex to be simple and fast, but to not deal with escaped quotes
The OP was asked specifically whether the string $str = "hello, I
said: \"How are you?\"" would be invalid and did not respond
The OP mentioned performance (efficiency as a criterion)
I'm also not a fan of code that is tough to read, so I used the <<< Nowdoc notation to avoid having to escape anything in the regex pattern
My solution:
$strings = [
"'hello's the word'",
"'hello is the word'",
'"hello "there" he said"',
'"hello there he said"',
'"Hi',
"'hello",
"no quotes",
"''"
];
$regexp = <<< 'TEXT'
/^('|")(?:(?!\1).)+\1$/
TEXT;
foreach ($strings as $string):
echo "$string - ".(preg_match($regexp,$string)?'true':'false')."<br/>";
endforeach;
Output:
'hello's the word' - false
'hello is the word' - true
"hello "there" he said" - false
"hello there he said" - true
"Hi - false
'hello - false
no quotes - false
'' - false
How it works:
^('|") //starts with single or double-quote
(?: //non-capturing group
(?!\1) //next char is not the same as first single/double quote
. //advance one character
)+ //repeat group with next char (there must be at least one char)
\1$ //End with the same single or double-quote that started the string

PHP Regex for matching a UNC path

I'm after a bit of regex to be used in PHP to validate a UNC path passed through a form. It should be of the format:
\\server\something
... and allow for further sub-folders. It might be good to strip off a trailing slash for consistency although I can easily do this with substr if need be.
I've read online that matching a single backslash in PHP requires 4 backslashes (when using a "C like string") and think I understand why that is (PHP escaping (e.g. 2 = 1, so 4 = 2), then regex engine escaping (the remaining 2 = 1). I've seen the following two quoted as equivalent suitable regex to match a single backslash:
$regex = "/\\\\/s";
or apparently this also:
$regex = "/[\\]/s";
However these produce different results, and that is slightly aside from my final aim to match a complete UNC path.
To see if I could match two backslashes I used the following to test:
$path = "\\\\server";
echo "the path is: $path <br />"; // which is \\server
$regex = "/\\\\\\\\\/s";
if (preg_match($regex, $path))
{
echo "matched";
}
else
{
echo "not matched";
}
The above however seems to match on two or more backslashes :( The pattern is 8 slashes, translating to 2, so why would an input of 3 backslashes ($path = "\\\\\\server") match?
I thought perhaps the following would work:
$regex = "/[\\][\\]/s";
and again, no :(
Please help before I jump out a window lol :)
Use this little gem:
$UNC_regex = '=^\\\\\\\\[a-zA-Z0-9-]+(\\\\[a-zA-Z0-9`~!##$%^&(){}\'._-]+([ ]+[a-zA-Z0-9`~!##$%^&(){}\'._-]+)*)+$=s';
Source: http://regexlib.com/REDetails.aspx?regexp_id=2285 (adopted to PHP string escaping)
The RegEx shown above matches for valid hostname (which allows only a few valid characters) and the path part behind the hostname (which allows many, but not all characters)
Sidenote on the backslashes issue:
When you use double quotes (") to enclose your string, you must be aware of PHP special character escaping.. "\\" is a single \ in PHP.
Important: even with single quotes (') those backslashes must be escaped.
A PHP string with single quotes takes everything in the string literally (unescaped) with a few exceptions:
A backslash followed by a backslash (\\) is interpreted as a single backslash.
('C:\\*.*' => C:\*.*)
A backslash followed by a single-quote (\') is interpreted as a single quote.
('I\'ll be back' => I'll be back)
A backslash followed by anything else is interpreted as a backslash.
('Just a \ somewhere' => Just a \ somewhere)
Also, you must be aware of PCRE escape sequences.
The RegEx parser treats \ for character classes, so you need to escape it for RegEx, again.
To match two \\ you must write $regex = "\\\\\\\\" or $regex = '\\\\\\\\'
From the PHP docs on PCRE escape sequences:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \, then "\\" or '\\' must be used in PHP code.
Regarding your Question:
why would an input of 3 backslashes ($path = "\\\server") match with regex "/\\\\\\\\/s"?
The reason is that you have no boundaries defined (use ^ for beginning and $ for end of string), thus it finds \\ "somewhere" resulting in a positive match. To get the expected result, you should do something like this:
$regex = '/^\\\\\\\\[^\\\\]/s';
The RegEx above has 2 modifications:
^ at the beginning to only match two \\ at the beginning of the string
[^\\] negative character class to say: not followed by an additional backslash
Regarding your last RegEx:
$regex = "/[\\][\\]/s";
You have a confusion (see above for clarification) with backslash escaping here. "/[\\][\\]/s" is interpreted by PHP to /[\][\]/s, which will let the RegEx fail because \ is a reserved character in RegEx and thus must be escaped.
This variant of your RegEx would work, but also match any occurance of two backslashes for the same reason i already explained above:
$regex = '/[\\\\][\\\\]/s';
Echo your regex as well, so you see what's the actual pattern, writing those slashes inside PHP can become akward for the pattern, so you can verify it's correct.
Also you should put ^ at the beginning of the pattern to match from string start and $ to the end to specify that the whole string has to be matched.
\\server\something
Regex:
~^\\\\server\\something$~
PHP String:
$pattern = '~^\\\\\\\\server\\\\something$~';
For the repetition, you want to say that a server exists and it's followed by one or more \something parts. If server is like something, this can be simplified:
^\\(?:\\[a-z]+){2,}$
PHP String:
$pattern = '~^\\\\(?:\\\\[a-z]+){2,}$~';
As there was some confusion about how \ characters should be written inside single quoted strings:
# Output:
#
# * Definition as '\\' ....... results in string(1) "\"
# * Definition as '\\\\' ..... results in string(2) "\\"
# * Definition as '\\\\\\' ... results in string(3) "\\\"
$slashes = array(
'\\',
'\\\\',
'\\\\\\',
);
foreach($slashes as $i => $slashed) {
$definition = sprintf('%s ', var_export($slashed, 1));
ob_start();
var_dump($slashed);
$result = rtrim(ob_get_clean());
printf(" * Definition as %'.-12s results in %s\n", $definition, $result);
}

how do i correct this regular expressions pattern for php

How do i make this match the following text correctly?
$string = "(\'streamer\',\'http://dv_fs06.ovfile.com:182/d/pftume4ksnroarhlslexwl7bcnoqyljeudgmd7dimssniu2b2r2ikr2h/video.flv\')";
preg_match("/streamer\\'\,\\\'(.*?)\\\'\)/", $string , $result);
var_dump($result);
Your $string looks weird. Better to make a three pass parse:
$string = str_replace(array("\'"), '', $string);
Now we have string:
"(streamer,http://dv_fs06.ovfile.com:182/d/pftume4ksnroarhlslexwl7bcnoqyljeudgmd7dimssniu2b2r2ikr2h/video.flv)"
Now let's trim brackets:
$string = trim($string, '()');
And finaly, explode:
list($streamer, $url) = explode(',', $string, 2);
No need of regex.
Btw, your string looks like it was crappyly slashed in mysql query.
It's been a while since I last did regexp matching in PHP, but I think you have to remember that:
' doesn't need to be escaped in PHP strings enclosed by "
\ always needs to be escaped in PHP strings
\ needs to be escaped yet another time in regexps (for it's a special character and you want to treat it as a normal one)
=> \ as part of the string to be matched must be escaped 4 times.
My suggestion:
preg_match("/\\(streamer\\\\',\\\\'(.*?)\\\\'\\)/", $string , $result);
You're on the right track. Two barriers to overcome (As codethief says):
1 - Double quoted string interpolation
2 - Regex escape interpolation
For (2), neither comma's nor quotes need to be escaped because they are not metachars
special to regex's. Only the backslash as a literal needs to be escaped, otherwise
in regex context, it represents the start of a metachar sequence (like \s).
For (1), php will try to interpolate escaped chars as a control code (like \n), for
that reason the literal backslash needs to be escaped. Since this is double quoted,
\' the escaped single qoute has no escape meaning.
Therefore, "\\\'" resolves to \\ = \ + \'=\' ~ \\' which is what the regex sees.
Then the regex interpolates the sequence /\\'/ as a literal \+'.
Making a slight change of your regex solves the problem:
preg_match("/streamer\\\',\\\'(.*?)\\\'\)/", $string , $result);
A working example is here http://beta.ideone.com/47EIY

Categories