Preg Patterns, to ignore escaped characters - php

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.

Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';

Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

Related

RegExp Capture literals

I need a way to strip all literals from PHP files. My current regexp solution works fine when there is no nested quotes in the string. Tried updating it to handle escaped quotes as well, which did work in most cases, except when there are escaped escape characters in the string.
This is what it should be able to handle, if this should be done correctly
"text"
"\"text\""
"\\"
"\"\\\""
So as I see it, it needs to handle cases where there are an even amount of escape characters and cases where there are an uneven amount. But how do you get this into regexp?
Update
I want to clean up PHP files to make them easier to search through and index different parts, something for a small project that I am playing with. Since literals can contain mostly anything, they can also contain data similar to some of the searches. So I want to remove anything in the files that is wrapped in " or '.
"/\"[^\"]*\"/"
This will work unless there is a nested quote "\"data\"".
"/\"(\\\\\"|[^\"])*\"/"
This will work unless there is "\\"
This is what I need
$var = "...";
Becomes
$var = ;
You could use this regular expression based substitution:
Find: ((?<!\\)(?:\\.)*)(["'])(?:\\.|(?!\2).)*?\2
Replace: $1
Note that if you are going to use this regular expression in PHP (where you encode it as a string literal) you need to escape the backslashes and quote in that regular expression, so like this:
preg_replace("~((?<!\\\\)(?:\\\\.)*)([\"'])(?:\\\\.|(?!\\2).)*?\\2~s", "$1", $input);
As PHP string literals can span multiple lines, the s modifier is added so that . matches newline characters also.
See it run on eval.in
NB: You'll need to think about heredoc notation also...

To double escape or not to double escape in PHP PCRE functions?

I was looking for a solid article on when double escaping is necessary and when it is not, but I was not able to find anything. Perhaps I didn't look hard enough, because I'm sure there is an explanation out there somewhere, but lets just make it easy to find for the next guy that has this question!
Take for example the following regex patterns:
/\n/
/domain\.com/
/myfeet \$ your feet/
Nothing ground breaking right? OK, lets use those examples within the context of PHP's preg_match function:
$foo = preg_match("/\n/", $bar);
$foo = preg_match("/domain\.com/", $bar);
$foo = preg_match("/myfeet \$ your feet/", $bar);
To my understanding, a backslash in the context of a quoted string value escapes the following character, and the expression is being given via a quoted string value.
Would the previous be like doing the folloing, and wouldnt this cause an error?:
$foo = preg_match("/n/", $bar);
$foo = preg_match("/domain.com/", $bar);
$foo = preg_match("/myfeet $ your feet/", $bar);
Which is not what I want right? those expressions are not the same as above.
Would I not have to write them double escaped like this?
$foo = preg_match("/\\n/", $bar);
$foo = preg_match("/domain\\.com/", $bar);
$foo = preg_match("/myfeet \\$ your feet/", $bar);
So that when PHP processes the string it escapes the backslash to a backslash which is then left in when its passed to the PCRE interpreter?
Or does PHP just magically know that I want to pass that backslash to the PCRE interpreter... i mean how does it know I'm not trying to \" escape a quote that I want to use in my expression? or are only double slashes required when using an escaped quote? And for that matter, would you need to TRIPLE escape a quote? \\\" You know, so that the quote is escaped and a double is left over?
Whats the rule of thumb with this?
I just did a test with PHP:
$bar = "asdfasdf a\"ONE\"sfda dsf adsf me & mine adsf asdf asfd ";
echo preg_match("/me \$ mine/", $bar);
echo "<br /><br />";
echo preg_match("/me \\$ mine/", $bar);
echo "<br /><br />";
echo preg_match("/a\"ONE\"/", $bar);
echo "<br /><br />";
echo preg_match("/a\\\"ONE\\\"/", $bar);
echo "<br /><br />";
Output:
0
1
1
1
So, it looks like somehow it doesnt really matter for quotes, but for the dollar sign, a double escape is required as I thought.
Double quoted strings
When it comes to escaping inside double quotes, the rule is that PHP will inspect the character(s) immediately following the backslash.
If the neighboring character is in the set ntrvef\$" or if a numeric value follows it (rules can be found here) it gets evaluated as the corresponding control character or ordinal (hexadecimal or octal) representation, respectively.
It's important to note that if an invalid escape sequence is given, the expression is not evaluated and both the backslash and character remain. This is different from some other languages where an invalid escape sequence would cause an error instead.
E.g. "domain\.com" will be left as is.
Note that variables get expanded inside double quotes as well, e.g. "$var" needs to be escaped as "\$var".
Single quotes strings
Since PHP 5.1.1, any backslash inside single quoted strings (and followed by at least one character) will get printed as is and no variables get substituted either. This is by far the most convenient feature of single quoted strings.
Regular expressions
For escaping regular expressions, it's best to leave escaping to preg_quote():
$foo = preg_match('/' . preg_quote('mine & yours', '/') . '/', $bar);
This way you don't have to worry about which characters need to be escaped, so it works well for user input.
See also: preg_quote
Update
You added this test:
"/me \$ mine/"
This gets evaluated as "/me $ mine/"; but in PCRE the $ has a special meaning (it's an end-of-subject anchor).
"/me \\$ mine/"
This is evaluated as "/me \$ mine/" and so the backslashes is escaped for PHP itself whereas the $ is escaped for PCRE. This only works by accident btw.
$var = 'something';
"/me \\$var mine/"
This gets evaluated as "/me \something", so you need to escape the $ again.
"/me \\\$var mine/"
Use single quotes. They prevent escape sequences from occurring.
For example:
php > print "hi\n";
hi
php > print 'hi\n';
hi\nphp >
Whenever you have an invalid escape sequence, PHP actually leaves the characters literally in the string. From the documentation:
As in single quoted strings, escaping any other character will result in the backslash being printed too.
I.e. "\&" really is interpreted as "\&". There are not that many escape sequences, so in most cases you probably get away with a single backslash. But for consistency, escaping the backslash might be a better choice.
As always: Know what you are doing :)
OK So I did some more testing and discovered the RULE OF THUMB when encapsulating a PCRE in DOUBLE QUOTES, the following holds true:
$ - Requires double escape because PHP will interpret that as the beginning of a variable if text is immediately following it. Left unescaped and it will indicate the end of your needle and will break.
\r\n\t\v - Special PHP string escapes, single escape required only.
[\^$.|?*+() - Special RegEx characters, require single escape only. Double escape does not seem to break expressions when used unnecessarily.
" - Quotes are obviously going to have to be escaped due to the encapsulation, but only need to be escaped once.
\ - Searching for a backslash? Using the double quote encapsulation of your expression, this will require 3 escapes! \\ (four backslashes in total)
Anything I'm missing?
I'll start saying that all I'll write below is not exactly what happens, but, for clarity, I'll simplify it.
Imagine that there are two evaluations happening when using regular expressions: the first being done by PHP and the second being done by PCRE, as if they were separate engines. And for our bad luck,
PHP AND PCRE EVALUATES THINGS IN DIFFERENT WAYS.
We have 3 "guys" here: 1) the USER; 2) the PHP and; 3) the PCRE.
The USER communicates with PHP by writing the CODE, which is exactly what you type in a code editor.
PHP then evaluates this CODE and sends another bit of information to PCRE. This bit of information is different from what you typed in your CODE.
PCRE then evaluates it and returns something to PHP, that evaluates this response and returns something to the USER.
I'll explain better in the exemple below. There I'm going to use the backslash ("\") to ilustrate what's going on.
Assume this bit of CODE in a php file:
<?php
$sub = "A backslash \ in a string";
$pat1 = "#\#";
$pat2 = "#\\#";
$pat3 = "#\\\#";
$pat4 = "#\\\\#";
echo "sub: ".$sub;
echo "\n\n";
echo "pat1: ".$pat1;
echo "\n";
echo "pat2: ".$pat2;
echo "\n";
echo "pat3: ".$pat3;
echo "\n";
echo "pat4: ".$pat4;
?>
This will print:
sub: A backslash \ in a string
pat1: #\#
pat2: #\#
pat3: #\\#
pat4: #\\#
In this exemple, there is no regular expression involved, so there is only the PHP evaluation of the code happening.
PHP leaves a backslash as is if it doesn't precede any special character. That's why it prints the backslash correctly in $sub.
PHP evaluates $pat1 and $pat2 EXACTLY the same, because in $pat1 the backslash is left as is, and in $pat2 the first backslash escapes the second, resulting in a single backslash.
Now, in $pat3, the first backslash escapes the second, resulting in one backslash. Then PHP evaluates the third backslash and leaves it as is because it is not preceding anything special. The result is going to be the double backslash.
Now someone could say "but now we have two backslashes again! shouldn't the first one escape the second one again?!"
The answer is "No". After PHP evaluates the first two backslashes into a single one, it doesn't look back again, and keeps moving on evaluating what is next.
At this point you already know what's going on with $pat4: the first backslash escapes the second and the third escapes the fourth, leaving two in the end.
Now that it's clear what PHP is doing to these strings, let's add some more code after the previous one.
if (preg_match($pat1, $sub)) echo "test1: true"; else echo "test1: false";
echo "\n";
if (preg_match($pat2, $sub)) echo "test2: true"; else echo "test2: false";
echo "\n";
if (preg_match($pat3, $sub)) echo "test3: true"; else echo "test3: false";
echo "\n";
if (preg_match($pat4, $sub)) echo "test4: true"; else echo "test4: false";
And the result is:
test1: false
test2: false
test3: true
test4: true
So, what's going on here is that PHP is not sending "what you typed" in the CODE directly to PCRE. Instead, PHP is sending what it has evaluated previously (which are exactly what we saw above).
For test1 and test2, even though we have written different patterns in the CODE for each test, PHP is sending the same pattern #\# to PCRE. The same thing happens for test3 and test4: PHP is sending #\\#. So, the results for test1 and test2 are the same, as well as for test3 and test4.
Now, what's going on when PCRE evaluates these patterns? PCRE doesn't act like PHP.
In test1 and test2, when PCRE sees a single backslash escaping nothing special (or nothing at all), it doesn't leave it as is. Instead, it problably thinks "what the hell is this?" and returns an error to PHP (actually, I don't really know what goes on when sending a single backslash to PCRE, searched for this, but still no conclusive). Then PHP takes what we are assuming is an error and evaluates it as "false" and returns that to the rest of the CODE (in this exemple, the if () function).
In test3 and test4, things go as we now expect: PCRE evaluates the first backslash as escaping the second, resulting in a single backslash. That of course matches the $sub string and returns a "successful message" to PHP, which evaluates it as "true".
ANSWERING QUESTIONS
Some characters are special to PHP (e.g. n for NEW LINE, t for TAB).
Some characters are special to PCRE (e.g. . (dot) to match any character, s to match whitespaces).
And some characters are special to both (e.g. $ to php is the beginning of the name of a variable and to PCRE it asserts the end of the subject).
That's why you need to escape newlines just once, like this \n. PHP will evaluate it as the REAL character NEW LINE and send that to PCRE.
For the dot, if you want to match that specific character, you should use \. and PHP will do nothing because the dot isn't a special character to PHP in a string. Instead, it will send them as is to PCRE. Now on PCRE, it will "see" a backslash preceding a dot and understand that it should match that specific character. If you use a double escape \\. the first backslash will escape the second, leaving you with the same result.
And if you want to match a dollar sign in a string, then you should use \\\$. In PHP, the first backslash will escape the second one, leaving a single backslash. Then the third backslash will escape the dollar sign. In the end, the result is \$. This is what PCRE will receive. PCRE will see that backslash and understand that the dollar sign is not asserting end of subject, but the literal character.
QUOTES
And now we've come to quotes. The problem with them is the fact that PHP evaluates a string in different ways, depending on the quotes used to surround it. Check it out: Strings
All I said until this point is valid for double quotes.
If you try this '\n' in single quotes, PHP will evaluate that backslash as a literal one.
But, if it is used in a regular expression, PCRE will get this string as is. And since n is also special to PCRE, it will interpret that as a newline character, and BOOM, it "magicaly" matches a newline in a string.
Check the escape sequences here: Escape Sequences
As I said in the beginning, things area not exactly as I tried to explain here, but I really hope it helps (and not make it more confusing than it already is).

PHP preg_replace backslash

I have double backslashes '\' in my string that needs to be converted into single backslashes '\'. I've tried several combinations and end up with the whole string disappearing when I used echo or more backslashes are added to the string by accident. This regex thing is making me go bonkers...lol...
I tried this amongst other failed attempts:
$pattern = '[\\]';
$replacement = '/\/';
?>
<td width="100%"> <?php echo preg_replace($pattern, $replacement,$q[$i]);?></td>
I do apologise if this is a foolish issue and I appreciate any pointers.
Use stripslashes() - it does exactly what you're looking for.
<td width="100%"> <?php echo stripslashes($q[$i]);?></td>
Use stripslashes instead. Also, in your regex, you are searching for single backslashes and your replacement is incorrect. \\{2} should search for double backslashes and \ should replace them with singles, although I haven't tested this.
Just to explain further, the pattern [\\] matches any character in a set comprised of a single backslash. In php, you should also delimit your regex with forward slashes: /[\\]/
Your replacement, which is (without delimiters) \, is not a regular expression for matching a single backslash. The regex for matching a single backslash is \\. Note the escaping. This said, the replacement term needs to be a string, not a regex (with the exception of backreferences).
EDIT: Sven claims below that stripslashes removes all backslashes. This is simply not true, and I will explain why below.
If a string contains 2 backslashes, the first one will be considered an escaping backslash and will be removed. This can be seen at http://www.phpfiddle.org/main/code/3yn-2ut. The fact that any backslashes remain at all by itself contradicts the claim that stripslashes removes all backslashes.
Just to clarify, this string declaration is invalid: $x = "\";, since the backslash escapes the second quote. This string "\\" contains one backslash. In the process of unquoting this string, this backslash will be removed. This "\\\\" string contains two backslashes. When unquoting, the first will be considered an escaping backslash, and will be removed.
Use preg_replace to turn double backslash into single backslash:
preg_replace('/\\\\{2}/', '\\', $str)
The \ in the first parameter needs to be escaped twice, once for string and once more for regex, just like CodeAngry says.
In the second parameter it only gets excaped once for string.
Make sense?
Never use a regular expression if the string you are looking for is constant, as is the case with "Every instance of double backslash".
Use str_replace() for this task. It is a very easy function that replaces every occurance of a string with another.
In your case: str_replace('\\\\', '\\', $var).
The double backslash actually translates into four backslashed, because inside any quotes (single or double), a single backslash is the start of an escape sequence for the following character. If you want one literal backslash, you have to write two of them. You want two backslashes, you have to write four of them.
I do not like the suggestion of stripslashes(). This will of course "decode" your double backslash into one single backslash. But it will also remove all single backslashes in the whole string. If there were none - fine, otherwise things will fail now.
$pattern = '[\\]'; // wrong
$pattern = '[\\\\]'; // right
escape \ as \\ and escape \\ as \\\\ because \\] means escaped ].
Use htmlentities function to convert your slashes to html entities then using str_replace or preg_match to change them with new entity

Can you use back references in the pattern part of a regular expression?

Is there a way to back reference in the regular expression pattern?
Example input string:
Here is "some quoted" text.
Say I want to pull out the quoted text, I could create the following expression:
"([^"]+)"
This regular expression would match some quoted.
Say I want it to also support single quotes, I could change the expression to:
["']([^"']+)["']
But what if the input string has a mixture of quotes say Here is 'some quoted" text. I would not want the regex to match. Currently the regex in the second example would still match.
What I would like to be able to do is if the first quote is a double quote then the closing quote must be a double. And if the start quote is single quote then the closing quote must be single.
Can I use a back reference to achieve this?
My other related question: Getting text between quotes using regular expression
You can make use of the regex:
(["'])[^"']+\1
() : used for grouping
[..] : is the char class. so ["']
matches either " or ' equivalent
to "|'
[^..] : char class with negation.
It matches any char not listed after
the ^
+ : quantifier for one or more
\1 : backreferencing the first
group which is (["'])
In PHP you'd use this as:
preg_match('#(["\'])[^"\']+\1#',$str)
preg_match('/(["\'])([^"\']+)\1/', 'Here is \'quoted text" some quoted text.');
Explanation: (["'])([^"']+)\1/ I placed the first quote in parentheses. Because this is the first grouping, it's back reference number is 1. Then, where the closing quote would be, I placed \1 which means whichever character was matched in group 1.
/"\(.*?\)".*?\1/ should work, but it depends on the regular expression engine
This is old. But you need to provide the $matches variable in preg_match($pattern, $subject, &$matches)
Then you can use it var_dump($matches)
see https://www.php.net/manual/en/function.preg-match

RegEx in PHP: Matching a pattern outside of non-escaped quotes

I'm writing a method to lift certain data out of an SQL query string, and I need to regex match any word inside of curly braces ONLY when it appears outside of single-quotes. I also need it to factor in the possibility of escaped (preceded by backslash) quotes, as well as escaped backslashes.
In the following examples, I need the regex to match {FOO} and not {BAR}:
blah blah {FOO} blah 'I\'m typing {BAR} here with an escaped backslash \\'
blah blah {FOO} 'Three backslashes {BAR} and an escaped quote \\\\\\\' here {BAR}'
I'm using preg_match in PHP to get the word in the braces ("FOO", in this case). Here's the regex string I have so far:
$regex = '/' .
// Match the word in braces
'\{(\w+)\}' .
// Only if it is followed by an even number of single-quotes
'(?=(?:[^\']*\'[^\']*\')*[^\']*$)' .
// The end
'/';
My logic is that, since the only thing I'm parsing is a legal SQL string (besides the brace-thing I added), if a set of braces is followed by an even number of non-escaped quotes, then it must be outside of quotes.
The regex I provided is 100% successful EXCEPT for taking escaped quotes into consideration. I just need to make sure there is no odd number of backslashes before a quote match, but for the life of me I can't seem to convey this in RegEx. Any takers?
The way to deal with escaped quotes and backslashes is to consume them in matched pairs.
(?=(?:(?:(?:[^\'\\]++|\\.)*+\'){2})*+(?:[^\'\\]++|\\.)*+$)
In other words, as you scan for the next quote, you skip any pair of characters that starts with a backslash. That takes care of both escaped quotes and escaped backslashes. This lookahead will allow escaped characters outside of quoted sections, which probably isn't necessary, but it probably won't hurt either.
p.s., Notice the liberal use of possessive quantifiers (*+ and ++); without those you could have performance problems, especially if the target strings are large. Also, if the strings can contain line breaks, you may need to do the matching in DOTALL mode (aka, "singleline" or "/s" mode).
However, I agree with mmyers: if you're trying to parse SQL, you will run into problems that regexes can't handle at all. Of all the things that regexes are bad at, SQL is one of the worst.
Simply, and perhaps naively, str_replace out all your double backslashes. Then str_replace out escaped single quotes. At that point it's relatively simple to find matches that are not between single quotes (using your existing regex, for example).
If you really want to use regular expressions for this, I would do it in two steps:
Separate the strings from the non-strings with preg_split:
$re = "('(?:[^\\\\']+|\\\\(\\\\\\\\)*.)*')";
$parts = preg_split('/'.$re.'/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
Replace the whatever in the strings:
foreach ($parts as $key => $val) {
if (preg_match('/^'.$re.'$/', $val)) {
$parts[$key] = preg_replace('/\{([^}]*)}/', '$1', $val);
}
}
But a real parser would probably be better as this approach is not that efficient.

Categories