RegEx in PHP: Matching a pattern outside of non-escaped quotes - php

I'm writing a method to lift certain data out of an SQL query string, and I need to regex match any word inside of curly braces ONLY when it appears outside of single-quotes. I also need it to factor in the possibility of escaped (preceded by backslash) quotes, as well as escaped backslashes.
In the following examples, I need the regex to match {FOO} and not {BAR}:
blah blah {FOO} blah 'I\'m typing {BAR} here with an escaped backslash \\'
blah blah {FOO} 'Three backslashes {BAR} and an escaped quote \\\\\\\' here {BAR}'
I'm using preg_match in PHP to get the word in the braces ("FOO", in this case). Here's the regex string I have so far:
$regex = '/' .
// Match the word in braces
'\{(\w+)\}' .
// Only if it is followed by an even number of single-quotes
'(?=(?:[^\']*\'[^\']*\')*[^\']*$)' .
// The end
'/';
My logic is that, since the only thing I'm parsing is a legal SQL string (besides the brace-thing I added), if a set of braces is followed by an even number of non-escaped quotes, then it must be outside of quotes.
The regex I provided is 100% successful EXCEPT for taking escaped quotes into consideration. I just need to make sure there is no odd number of backslashes before a quote match, but for the life of me I can't seem to convey this in RegEx. Any takers?

The way to deal with escaped quotes and backslashes is to consume them in matched pairs.
(?=(?:(?:(?:[^\'\\]++|\\.)*+\'){2})*+(?:[^\'\\]++|\\.)*+$)
In other words, as you scan for the next quote, you skip any pair of characters that starts with a backslash. That takes care of both escaped quotes and escaped backslashes. This lookahead will allow escaped characters outside of quoted sections, which probably isn't necessary, but it probably won't hurt either.
p.s., Notice the liberal use of possessive quantifiers (*+ and ++); without those you could have performance problems, especially if the target strings are large. Also, if the strings can contain line breaks, you may need to do the matching in DOTALL mode (aka, "singleline" or "/s" mode).
However, I agree with mmyers: if you're trying to parse SQL, you will run into problems that regexes can't handle at all. Of all the things that regexes are bad at, SQL is one of the worst.

Simply, and perhaps naively, str_replace out all your double backslashes. Then str_replace out escaped single quotes. At that point it's relatively simple to find matches that are not between single quotes (using your existing regex, for example).

If you really want to use regular expressions for this, I would do it in two steps:
Separate the strings from the non-strings with preg_split:
$re = "('(?:[^\\\\']+|\\\\(\\\\\\\\)*.)*')";
$parts = preg_split('/'.$re.'/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
Replace the whatever in the strings:
foreach ($parts as $key => $val) {
if (preg_match('/^'.$re.'$/', $val)) {
$parts[$key] = preg_replace('/\{([^}]*)}/', '$1', $val);
}
}
But a real parser would probably be better as this approach is not that efficient.

Related

Problems with quotation marks, preg_replace or str_replace

Okay so, I have done stuff with regex before, but this time my brain doesn't seem to wanna work with me.
I'm trying to remove some " and some ', from some json in a string. Here's how far I got with preg_replace.
$string = '"cusComment": "Direct from user input, so need to remove "double quotation marks" and \'single ones as well", "intComment": "" }}';
$blab = preg_replace('/["cusComment": "]"[", "intComment"]/', "", $string);
echo $blab;
This almost works for removing ", with some unwanted results.
Edit:
I guess you could do it the "other way around", and only let letters, numbers, punctation, comma, dash, underscore and white space through... still need help :)
This can be achieved with the (*SKIP)(*FAIL) technique. It effectively works to disqualify specific matches and only return the substrings (double quotes in this case) that you want.
The following pattern only handles the tricky double quote matching with regex. A simple call of str_replace() handles the single quotes because ALL of them are to be omitted.
Pattern: /("[:,] "|^"|" \}{2})(*SKIP)(*FAIL)|"|\\'/
This pattern says, disqualify the following matches:
two double quotes that are separated by a colon or comma and space
a double quote at the start of the line
a double quote followed by a space and two closing curly brackets
a single quote preceded by a slash
Demo of Pattern & Replacement
PHP Code: (Demo) *the pattern needs an extra slash in the php implementation
$string = '"cusComment": "Direct from user input, so need to remove "double quotation marks" and \'single ones as well", "intComment": "" }}';
$blab = preg_replace('/("[:,] "|^"|" \}{2})(*SKIP)(*FAIL)|"|\\\'/', "", $string);
echo $blab;
I'm not sure what further modifying you are doing to this string to prepare it, but those task may or may not be well coupled with this pattern.
If you are going to remove the trailing } and empty spaces, it seems logical to call rtrim($blab,' }'), but it is also reasonable to spare the extra function call and just extend the pattern:/("[:,] "|^"|"(?= \}{2}$))(*SKIP)(*FAIL)|"|\\'| \}{2}$/

Preg Patterns, to ignore escaped characters

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.
Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

PHP preg_replace backslash

I have double backslashes '\' in my string that needs to be converted into single backslashes '\'. I've tried several combinations and end up with the whole string disappearing when I used echo or more backslashes are added to the string by accident. This regex thing is making me go bonkers...lol...
I tried this amongst other failed attempts:
$pattern = '[\\]';
$replacement = '/\/';
?>
<td width="100%"> <?php echo preg_replace($pattern, $replacement,$q[$i]);?></td>
I do apologise if this is a foolish issue and I appreciate any pointers.
Use stripslashes() - it does exactly what you're looking for.
<td width="100%"> <?php echo stripslashes($q[$i]);?></td>
Use stripslashes instead. Also, in your regex, you are searching for single backslashes and your replacement is incorrect. \\{2} should search for double backslashes and \ should replace them with singles, although I haven't tested this.
Just to explain further, the pattern [\\] matches any character in a set comprised of a single backslash. In php, you should also delimit your regex with forward slashes: /[\\]/
Your replacement, which is (without delimiters) \, is not a regular expression for matching a single backslash. The regex for matching a single backslash is \\. Note the escaping. This said, the replacement term needs to be a string, not a regex (with the exception of backreferences).
EDIT: Sven claims below that stripslashes removes all backslashes. This is simply not true, and I will explain why below.
If a string contains 2 backslashes, the first one will be considered an escaping backslash and will be removed. This can be seen at http://www.phpfiddle.org/main/code/3yn-2ut. The fact that any backslashes remain at all by itself contradicts the claim that stripslashes removes all backslashes.
Just to clarify, this string declaration is invalid: $x = "\";, since the backslash escapes the second quote. This string "\\" contains one backslash. In the process of unquoting this string, this backslash will be removed. This "\\\\" string contains two backslashes. When unquoting, the first will be considered an escaping backslash, and will be removed.
Use preg_replace to turn double backslash into single backslash:
preg_replace('/\\\\{2}/', '\\', $str)
The \ in the first parameter needs to be escaped twice, once for string and once more for regex, just like CodeAngry says.
In the second parameter it only gets excaped once for string.
Make sense?
Never use a regular expression if the string you are looking for is constant, as is the case with "Every instance of double backslash".
Use str_replace() for this task. It is a very easy function that replaces every occurance of a string with another.
In your case: str_replace('\\\\', '\\', $var).
The double backslash actually translates into four backslashed, because inside any quotes (single or double), a single backslash is the start of an escape sequence for the following character. If you want one literal backslash, you have to write two of them. You want two backslashes, you have to write four of them.
I do not like the suggestion of stripslashes(). This will of course "decode" your double backslash into one single backslash. But it will also remove all single backslashes in the whole string. If there were none - fine, otherwise things will fail now.
$pattern = '[\\]'; // wrong
$pattern = '[\\\\]'; // right
escape \ as \\ and escape \\ as \\\\ because \\] means escaped ].
Use htmlentities function to convert your slashes to html entities then using str_replace or preg_match to change them with new entity

how do i correct this regular expressions pattern for php

How do i make this match the following text correctly?
$string = "(\'streamer\',\'http://dv_fs06.ovfile.com:182/d/pftume4ksnroarhlslexwl7bcnoqyljeudgmd7dimssniu2b2r2ikr2h/video.flv\')";
preg_match("/streamer\\'\,\\\'(.*?)\\\'\)/", $string , $result);
var_dump($result);
Your $string looks weird. Better to make a three pass parse:
$string = str_replace(array("\'"), '', $string);
Now we have string:
"(streamer,http://dv_fs06.ovfile.com:182/d/pftume4ksnroarhlslexwl7bcnoqyljeudgmd7dimssniu2b2r2ikr2h/video.flv)"
Now let's trim brackets:
$string = trim($string, '()');
And finaly, explode:
list($streamer, $url) = explode(',', $string, 2);
No need of regex.
Btw, your string looks like it was crappyly slashed in mysql query.
It's been a while since I last did regexp matching in PHP, but I think you have to remember that:
' doesn't need to be escaped in PHP strings enclosed by "
\ always needs to be escaped in PHP strings
\ needs to be escaped yet another time in regexps (for it's a special character and you want to treat it as a normal one)
=> \ as part of the string to be matched must be escaped 4 times.
My suggestion:
preg_match("/\\(streamer\\\\',\\\\'(.*?)\\\\'\\)/", $string , $result);
You're on the right track. Two barriers to overcome (As codethief says):
1 - Double quoted string interpolation
2 - Regex escape interpolation
For (2), neither comma's nor quotes need to be escaped because they are not metachars
special to regex's. Only the backslash as a literal needs to be escaped, otherwise
in regex context, it represents the start of a metachar sequence (like \s).
For (1), php will try to interpolate escaped chars as a control code (like \n), for
that reason the literal backslash needs to be escaped. Since this is double quoted,
\' the escaped single qoute has no escape meaning.
Therefore, "\\\'" resolves to \\ = \ + \'=\' ~ \\' which is what the regex sees.
Then the regex interpolates the sequence /\\'/ as a literal \+'.
Making a slight change of your regex solves the problem:
preg_match("/streamer\\\',\\\'(.*?)\\\'\)/", $string , $result);
A working example is here http://beta.ideone.com/47EIY

Error in regexp php

There is a mistake in this code, I could not find it. What is the missing character do I need?
preg_replace(/<(?!\/?(?:'.implode('|',$white).'))[^\s>]+(?:\s(?:(["''])(?:\\\1|[^\1])*?\1|[^>])*)?>/','',$html);
It looks like among other things you're missing a single quote:
preg_replace('/<(?!\/?(?:' . implode('|',$white) . '))[...
^
here!
Also, since the pattern contains single-quotes, those would also have to be escaped by preceding with backslash.
Alternatively you could also use heredoc syntax; this would not require any escaping of quotes in the pattern, and expressions can be embedded for expansion.
$pattern = <<<EOD
/pattern{embeddedExpression}morePattern/
EOD;
... preg_replace($pattern, ...)
Do yourself a favor and use DOM and XPath instead of regex to parse HTML to avoid problems.
Well, this part is wrong:
(["'])(?:\\\1|[^\1])*?\1
That's supposed to match a sequence enclosed in single- or double quotes, possibly including backslash-escaped quotes. But it won't work because backreferences don't work in character classes. The \1 is treated as the number 1 in octal notation, so [^\1] matches any character except U+0001.
If it seems to work most of the time, it's because of the reluctant quantifier (*?). The first alternative in (?:\\\1|[^\1])*? correctly consumes an escaped quote, but otherwise it just matches any character, reluctantly, until it sees an unescaped quote. It works okay on well-formed text, but toss in an extra quote and it goes haywire.
The correct way to match "anything except what group #1 captured" is (?:(?!\1).)* - that is, consume one character at a time, but only after the lookahead confirms that it's not the first character of the captured text. But I think you'll be better off dealing with each kind of quote separately; this regex is complicated enough as it is.
'~<(?!/?+(?:'.implode('|',$white).')\b)[^\s>]++(?:\s++'.
'(?:[^\'">]++|"(?:[^"\\]++|\\")*+"|\'(?:[^\'\\]++|\\\')*+\')*+)?+>~'
Notice the addition of the \b (word boundary) after the whitelist alternation. Without that, if you have (for example) <B> in your list, you'll unintentionally whitelist <BODY> and <BLOCKQUOTE> tags as well.
I also used possessive quantifiers (*+, ++, ?+) everywhere, because the way this regex is written, I know backtracking will never be useful. If it's going to fail, I want it to fail as quickly as possible.
Now that I've told you how to get the regex to work, let me urge you not to use it. This job is too complex and too important to be done with such a poorly suited tool as regex. And if you really got that regex from a book on PHP security, I suggest you get your money back.

Categories