Problems with quotation marks, preg_replace or str_replace - php

Okay so, I have done stuff with regex before, but this time my brain doesn't seem to wanna work with me.
I'm trying to remove some " and some ', from some json in a string. Here's how far I got with preg_replace.
$string = '"cusComment": "Direct from user input, so need to remove "double quotation marks" and \'single ones as well", "intComment": "" }}';
$blab = preg_replace('/["cusComment": "]"[", "intComment"]/', "", $string);
echo $blab;
This almost works for removing ", with some unwanted results.
Edit:
I guess you could do it the "other way around", and only let letters, numbers, punctation, comma, dash, underscore and white space through... still need help :)

This can be achieved with the (*SKIP)(*FAIL) technique. It effectively works to disqualify specific matches and only return the substrings (double quotes in this case) that you want.
The following pattern only handles the tricky double quote matching with regex. A simple call of str_replace() handles the single quotes because ALL of them are to be omitted.
Pattern: /("[:,] "|^"|" \}{2})(*SKIP)(*FAIL)|"|\\'/
This pattern says, disqualify the following matches:
two double quotes that are separated by a colon or comma and space
a double quote at the start of the line
a double quote followed by a space and two closing curly brackets
a single quote preceded by a slash
Demo of Pattern & Replacement
PHP Code: (Demo) *the pattern needs an extra slash in the php implementation
$string = '"cusComment": "Direct from user input, so need to remove "double quotation marks" and \'single ones as well", "intComment": "" }}';
$blab = preg_replace('/("[:,] "|^"|" \}{2})(*SKIP)(*FAIL)|"|\\\'/', "", $string);
echo $blab;
I'm not sure what further modifying you are doing to this string to prepare it, but those task may or may not be well coupled with this pattern.
If you are going to remove the trailing } and empty spaces, it seems logical to call rtrim($blab,' }'), but it is also reasonable to spare the extra function call and just extend the pattern:/("[:,] "|^"|"(?= \}{2}$))(*SKIP)(*FAIL)|"|\\'| \}{2}$/

Related

Regex replace only double hyphens inside quotations

I have a document that's full of quotes, so like: "this is a quote". Some of those quotes have subclauses in two hyphens like: "this quote - this one right here - has em dashes", and some just have one hyphen like: "this quote has just one thing - a hyphen".
I'm trying to have some regex that matches all of the quotes with two hyphens, but not match any quotes with zero or one hyphen, and not match any of the text outside of the quotes. Also I should mention that there are some sentences with one or more hyphens that lie outside of quotes, I need to ignore them as well and not have them interfere with my matches in quotes. I want to change the properly matched quotes' double hyphens to proper em dash characters.
I've tried using lookaheads and negated characters, but can't seem to figure this one out.
Is this something regex can do, or do I need to come up with some kind of other approach (like splitting all of the text into an array and stepping through it, making my changes and then recombining it all at the end)? I can do that instead it just seems like a silly waste of time if there's a one-line regex statement that will do what I want.
Add a \b word boundary at the beginning of the quote, and check that the last character inside the quote is either a letter or number or some kind of punctuation.
("\b[^-"]*-[^-"]*-[^-"]*[\w.!?]")
"(?:[^-"]*-){2}[^-"]*" is about the best you can get with only regex, but it doesn't work if there are two hyphens outside of quotes. Splitting the text into an array is probably the best way to do what you want to.

How to escape quotes immediately following each other?

I want to escape all double quotes which are NOT escaped already.
I am using real_escape_string() already, but I wish to understand what is wrong with the follow regular expression/approach:
$str = '"Hello "" """ world!\"';
preg_replace('/(^|[^\\\\]{1})\"/', '${1}\"', $str);
(PS: I know - \\" will NOT be escaped and MIGHT be a problem in some other cases though this doesn't matter to my script.)
The result was:
\"Hello \"" \""\" world!\"
But I wanted it to be:
\"Hello \"\" \"\"\" world!\"
Here is how you escape your sql:
$str = mysql_real_escape_string($str);
or:
$str = mysqli_real_escape_string($str);
or
$str = *_real_escape_string($str);
// * is your db extention
Or you can use PDO to parametrize your input.
I think you're on the right track, but you're missing two key elements. The first is that you have to include the quote in the negated character class along with the backslash: [^"\\]*. When that part runs out of things to match, the next character (if there is one) must be a quote or a backslash.
If it's a backslash, \\. consumes it and the next character, whatever it is. It might be a quote, a backslash, or anything else; you don't care because you know it's been escaped. Then you go back to gobbling up non-special characters with [^"\\]*.
The other missing element is \G. It anchors the first match to the beginning of the string, just like \A. Each match after that has to start where the previous match ended. This way, when the final " in the regex comes into play, you know that every character before it has been examined, and you are indeed matching an unescaped quote.
$str = '"Hello "" """ world!\"';
$str = preg_replace('/\G([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"/', '$1\"', $str);

Preg Patterns, to ignore escaped characters

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.
Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

PHP preg_replace backslash

I have double backslashes '\' in my string that needs to be converted into single backslashes '\'. I've tried several combinations and end up with the whole string disappearing when I used echo or more backslashes are added to the string by accident. This regex thing is making me go bonkers...lol...
I tried this amongst other failed attempts:
$pattern = '[\\]';
$replacement = '/\/';
?>
<td width="100%"> <?php echo preg_replace($pattern, $replacement,$q[$i]);?></td>
I do apologise if this is a foolish issue and I appreciate any pointers.
Use stripslashes() - it does exactly what you're looking for.
<td width="100%"> <?php echo stripslashes($q[$i]);?></td>
Use stripslashes instead. Also, in your regex, you are searching for single backslashes and your replacement is incorrect. \\{2} should search for double backslashes and \ should replace them with singles, although I haven't tested this.
Just to explain further, the pattern [\\] matches any character in a set comprised of a single backslash. In php, you should also delimit your regex with forward slashes: /[\\]/
Your replacement, which is (without delimiters) \, is not a regular expression for matching a single backslash. The regex for matching a single backslash is \\. Note the escaping. This said, the replacement term needs to be a string, not a regex (with the exception of backreferences).
EDIT: Sven claims below that stripslashes removes all backslashes. This is simply not true, and I will explain why below.
If a string contains 2 backslashes, the first one will be considered an escaping backslash and will be removed. This can be seen at http://www.phpfiddle.org/main/code/3yn-2ut. The fact that any backslashes remain at all by itself contradicts the claim that stripslashes removes all backslashes.
Just to clarify, this string declaration is invalid: $x = "\";, since the backslash escapes the second quote. This string "\\" contains one backslash. In the process of unquoting this string, this backslash will be removed. This "\\\\" string contains two backslashes. When unquoting, the first will be considered an escaping backslash, and will be removed.
Use preg_replace to turn double backslash into single backslash:
preg_replace('/\\\\{2}/', '\\', $str)
The \ in the first parameter needs to be escaped twice, once for string and once more for regex, just like CodeAngry says.
In the second parameter it only gets excaped once for string.
Make sense?
Never use a regular expression if the string you are looking for is constant, as is the case with "Every instance of double backslash".
Use str_replace() for this task. It is a very easy function that replaces every occurance of a string with another.
In your case: str_replace('\\\\', '\\', $var).
The double backslash actually translates into four backslashed, because inside any quotes (single or double), a single backslash is the start of an escape sequence for the following character. If you want one literal backslash, you have to write two of them. You want two backslashes, you have to write four of them.
I do not like the suggestion of stripslashes(). This will of course "decode" your double backslash into one single backslash. But it will also remove all single backslashes in the whole string. If there were none - fine, otherwise things will fail now.
$pattern = '[\\]'; // wrong
$pattern = '[\\\\]'; // right
escape \ as \\ and escape \\ as \\\\ because \\] means escaped ].
Use htmlentities function to convert your slashes to html entities then using str_replace or preg_match to change them with new entity

RegEx in PHP: Matching a pattern outside of non-escaped quotes

I'm writing a method to lift certain data out of an SQL query string, and I need to regex match any word inside of curly braces ONLY when it appears outside of single-quotes. I also need it to factor in the possibility of escaped (preceded by backslash) quotes, as well as escaped backslashes.
In the following examples, I need the regex to match {FOO} and not {BAR}:
blah blah {FOO} blah 'I\'m typing {BAR} here with an escaped backslash \\'
blah blah {FOO} 'Three backslashes {BAR} and an escaped quote \\\\\\\' here {BAR}'
I'm using preg_match in PHP to get the word in the braces ("FOO", in this case). Here's the regex string I have so far:
$regex = '/' .
// Match the word in braces
'\{(\w+)\}' .
// Only if it is followed by an even number of single-quotes
'(?=(?:[^\']*\'[^\']*\')*[^\']*$)' .
// The end
'/';
My logic is that, since the only thing I'm parsing is a legal SQL string (besides the brace-thing I added), if a set of braces is followed by an even number of non-escaped quotes, then it must be outside of quotes.
The regex I provided is 100% successful EXCEPT for taking escaped quotes into consideration. I just need to make sure there is no odd number of backslashes before a quote match, but for the life of me I can't seem to convey this in RegEx. Any takers?
The way to deal with escaped quotes and backslashes is to consume them in matched pairs.
(?=(?:(?:(?:[^\'\\]++|\\.)*+\'){2})*+(?:[^\'\\]++|\\.)*+$)
In other words, as you scan for the next quote, you skip any pair of characters that starts with a backslash. That takes care of both escaped quotes and escaped backslashes. This lookahead will allow escaped characters outside of quoted sections, which probably isn't necessary, but it probably won't hurt either.
p.s., Notice the liberal use of possessive quantifiers (*+ and ++); without those you could have performance problems, especially if the target strings are large. Also, if the strings can contain line breaks, you may need to do the matching in DOTALL mode (aka, "singleline" or "/s" mode).
However, I agree with mmyers: if you're trying to parse SQL, you will run into problems that regexes can't handle at all. Of all the things that regexes are bad at, SQL is one of the worst.
Simply, and perhaps naively, str_replace out all your double backslashes. Then str_replace out escaped single quotes. At that point it's relatively simple to find matches that are not between single quotes (using your existing regex, for example).
If you really want to use regular expressions for this, I would do it in two steps:
Separate the strings from the non-strings with preg_split:
$re = "('(?:[^\\\\']+|\\\\(\\\\\\\\)*.)*')";
$parts = preg_split('/'.$re.'/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
Replace the whatever in the strings:
foreach ($parts as $key => $val) {
if (preg_match('/^'.$re.'$/', $val)) {
$parts[$key] = preg_replace('/\{([^}]*)}/', '$1', $val);
}
}
But a real parser would probably be better as this approach is not that efficient.

Categories