PHP: Regex to ignore escaped quotes within quotes - php

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good at regex).
Basically, here are my existing lines:
$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );
$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );
They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".
Any help would be greatly appreciated.

For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".
This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:
Good:
"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.
Better:
"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).
Best:
"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)
Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:
$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";

Try a regex like this:
'/"(\\\\[\\\\"]|[^\\\\"])*"/'
A (short) explanation:
" # match a `"`
( # open group 1
\\\\[\\\\"] # match either `\\` or `\"`
| # OR
[^\\\\"] # match any char other than `\` and `"`
)* # close group 1, and repeat it zero or more times
" # match a `"`
The following snippet:
<?php
$text = 'abc "string \\\\ \\" literal" def';
preg_match_all('/"(\\\\[\\\\"]|[^\\\\"])*"/', $text, $matches);
echo $text . "\n";
print_r($matches);
?>
produces:
abc "string \\ \" literal" def
Array
(
[0] => Array
(
[0] => "string \\ \" literal"
)
[1] => Array
(
[0] => l
)
)
as you can see on Ideone.

This has possibilities:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
/'(?>(?:(?>[^'\\]+)|\\.)*)'/

This will leave the quotes outside
(?<=['"])(.*?)(?=["'])
and use global /g will match all groups

This seems to be as fast as the unrolled loop, based on some cursory benchmarks, but is much easier to read and understand. It doesn't require any backtracking in the first place.
"[^"\\]*(\\.[^"\\]*)*"

According to W3 resources :
https://www.w3.org/TR/2010/REC-xpath20-20101214/#doc-xpath-StringLiteral
The general Regex is:
"(\\.|[^"])*"
(+ There is no need to add back-slashes in capturing group when they checked first)
Explain:
"..." any match between quotes
(...)* The inside can have any length from 0 to Infinity
\\.|[^"] First accept any char that have slash behind | (Or) Then accept any char that is not quotes
The PHP version of the regex with better grouping for better handling of Any Quotes can be like this :
<?php
$str='"First \\" \n Second" then \'This \\\' That\'';
echo $str."\n";
// "First \" \n Second" then 'This \' That'
$RX_inQuotes='/"((\\\\.|[^"])*)"/';
preg_match_all($RX_inQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][1]."\n";
// First \" \n Second
$RX_inAnyQuotes='/("((\\\\.|[^"])*)")|(\'((\\\\.|[^\'])*)\')/';
preg_match_all($RX_inAnyQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][2]." --- ".$r[1][5];
// First \" \n Second --- This \' That
?>
Try it: http://sandbox.onlinephpfunctions.com/code/4328cc4dfc09183f7f1209c08ca5349bef9eb5b4
Important Note: In this age, for not sure contents, you have to use u flag in end of the regex like /.../u for avoid of destroying multi-byte strings like UTF-8, or functions like mb_ereg_match.

Related

exploding a search string

I'm trying to make a search string, which can accept a query like this:
$string = 'title -launch category:technology -tag:news -tag:"outer space"$';
Here's a quick explanation of what I want to do:
$ = suffix indicating that the match should be exact
" = double quotes indicate that the multi-word is taken as a single keyword
- = a prefix indicating that the keyword is excluded
Here's my current parser:
$string = preg_replace('/(\w+)\:"(\w+)/', '"${1}:${2}', $string);
$array = str_getcsv($string, ' ');
I was using this above code before, but it doesn't work as intended with the keywords starting on searches like -tag:"outer space". The code above doesn't recognize strings starting with - character and breaks the keyword at the whitespace between the outer and space, despite being enclosed with double quotes.
EDIT: What I'm trying to do with that code is to preg_replace -tag:"outer space" into "-tag:outer space" so that they won't be broken when I pass the string to str_getcsv().
You may use preg_replace like this:
preg_replace('/(-?\w+:)"([^"]+)"/', '"$1$2"', $str);
See the PHP demo online.
The regex matches:
(-?\w+:) - Capturing group 1: an optional - (? matches 1 or 0 occurrences), then 1+ letters/digits/underscores and a :
" - a double quote (it will be removed)
([^"]+) - Capturing group 2: one or more chars other than a double quote
" - a double quote
The replacement pattern is "$1$2": ", capturing group 1 value,
capturing group 2 value, and a ".
See the regex demo here.
Here's how I did it:
$string = preg_replace('/(\-?)(\w+?\:?)"(\w+)/', '"$1$2$3', $string);
$array = str_getcsv($string, ' ');
I considered formats like -"top ten" for quoted multi-word keywords that doesn't have a category/tag + colon prefix.
I'm sorry for being slow, I'm new on regex, php and programming in general and this is also my first post in stackoverflow. I'm trying to learn it as a personal hobby. I'm glad that I learned something new today. I'll be reading more about regex since it looks like it can do a lot of stuff.

preg_replace not working but regex working [duplicate]

Lately I've been studying (more in practice to tell the truth) regex, and I'm noticing his power. This demand made by me (link), I am aware of 'backreference'. I think I understand how it works, it works in JavaScript, while in PHP not.
For example I have this string:
[b]Text B[/b]
[i]Text I[/i]
[u]Text U[/u]
[s]Text S[/s]
And use the following regex:
\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]
This testing it on regex101.com works, the same for JavaScript, but does not work with PHP.
Example of preg_replace (not working):
echo preg_replace(
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]/i",
"<$1>$2</$1>",
"[b]Text[/b]"
);
While this way works:
echo preg_replace(
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/(b|i|u|s)\]/i",
"<$1>$2</$1>",
"[b]Text[/b]"
);
I can not understand where I'm wrong, thanks to everyone who helps me.
It is because you use a double quoted string, inside a double quoted string \1 is read as the octal notation of a character (the control character SOH = start of heading), not as an escaped 1.
So two ways:
use single quoted string:
'/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]/i'
or escape the backslash to obtain a literal backslash (for the string, not for the pattern):
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\\1\]/i"
As an aside, you can write your pattern like this:
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\1]~i';
// with oniguruma notation
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\g{1}]~i';
// oniguruma too but relative:
// (the second group on the left from the current position)
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\g{-2}]~i';

Backreference does not work in PHP

Lately I've been studying (more in practice to tell the truth) regex, and I'm noticing his power. This demand made by me (link), I am aware of 'backreference'. I think I understand how it works, it works in JavaScript, while in PHP not.
For example I have this string:
[b]Text B[/b]
[i]Text I[/i]
[u]Text U[/u]
[s]Text S[/s]
And use the following regex:
\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]
This testing it on regex101.com works, the same for JavaScript, but does not work with PHP.
Example of preg_replace (not working):
echo preg_replace(
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]/i",
"<$1>$2</$1>",
"[b]Text[/b]"
);
While this way works:
echo preg_replace(
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/(b|i|u|s)\]/i",
"<$1>$2</$1>",
"[b]Text[/b]"
);
I can not understand where I'm wrong, thanks to everyone who helps me.
It is because you use a double quoted string, inside a double quoted string \1 is read as the octal notation of a character (the control character SOH = start of heading), not as an escaped 1.
So two ways:
use single quoted string:
'/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\1\]/i'
or escape the backslash to obtain a literal backslash (for the string, not for the pattern):
"/\[(b|i|u|s)\]\s*(.*?)\s*\[\/\\1\]/i"
As an aside, you can write your pattern like this:
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\1]~i';
// with oniguruma notation
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\g{1}]~i';
// oniguruma too but relative:
// (the second group on the left from the current position)
$pattern = '~\[([bius])]\s*(.*?)\s*\[/\g{-2}]~i';

Regex grab all text between brackets, and NOT in quotes

I'm attempting to match all text between {brackets}, however not if it is in quotation marks:
For example:
$str = 'value that I {want}, vs value "I do {NOT} want" '
my results should snatch "want", but omit "NOT". I've searched stackoverflow desperately for the regex that could perform this operation with no luck. I've seen answers that allow me to get the text between quotes but not outside quotes and in brackets. Is this even possible?
And if so how is it done?
So far this is what I have:
preg_match_all('/{([^}]*)}/', $str, $matches);
But unfortunately it only gets all text inside brackets, including {NOT}
It's quite tricky to get this done in one go. I even wanted to make it compatible with nested brackets so let's also use a recursive pattern :
("|').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}
Ok, let's explain this mysterious regex :
("|') # match eiter a single quote or a double and put it in group 1
.*? # match anything ungreedy until ...
\1 # match what was matched in group 1
(*SKIP)(*FAIL) # make it skip this match since it's a quoted set of characters
| # or
\{(?:[^{}]|(?R))*\} # match a pair of brackets (even if they are nested)
Online demo
Some php code:
$input = <<<INP
value that I {want}, vs value "I do {NOT} want".
Let's make it {nested {this {time}}}
And yes, it's even "{bullet-{proof}}" :)
INP;
preg_match_all('~("|\').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}~', $input, $m);
print_r($m[0]);
Sample output:
Array
(
[0] => {want}
[1] => {nested {this {time}}}
)
Personally I'd process this in two passes. The first to strip out everything in between double quotes, the second to pull out the text you want.
Something like this perhaps:
$str = 'value that I {want}, vs value "I do {NOT} want" ';
// Get rid of everything in between double quotes
$str = preg_replace("/\".*\"/U","",$str);
// Now I can safely grab any text between curly brackets
preg_match_all("/\{(.*)\}/U",$str,$matches);
Working example here: http://3v4l.org/SRnva

Get Everything between two characters

I'm using PHP. I'm trying to get a Regex pattern to match everything between value=" and " i.e. Line 1 Line 2,...,to Line 4.
value="Line 1
Line 2
Line 3
Line 4"
I've tried /.*?/ but it doesn't seem to work.
I'd appreciate some help.
Thanks.
P.S. I'd just like to add, in response to some comments, that all strings between the first " and last " are acceptable. I'm just trying to find a way to get everything between the very first " and very last " even when there is a " in between. I hope this makes sense. Thanks.
Assuming the desired character is "double quote":
$pat = '/\"([^\"]*?)\"/'; // text between quotes excluding quotes
$value='"Line 1 Line 2 Line 3 Line 4"';
preg_match($pat, $value, $matches);
echo $matches[1]; // $matches[0] is string with the outer quotes
if you just want answer and not want specific regex,then you can use this:
<?php
$str='value="Line 1
Line 2
Line 3
Line 4"';
$need=explode("\"",$str);
var_dump($need[1]);
?>
/.*?/ has the effect to not match the new line characters. If you want to match them too, you need to use a regular expression like /([^"]*)/.
I agree with Josh K that a regular expression is not required in this case (especially if you know there will not be any apices apart the one to delimit the string). You could adopt the solution given by him as well.
If you must use regex:
if (preg_match('!"([^"]+)"!', $value, $m))
echo $m[1];
You need s pattern modifier. Something like: /value="(.*)"/s
I'm not a regex guru, but why not just explode it?
// Say $var contains this value="..." string
$arr = explode('value="');
$mid = explode('"', $arr[1]);
$fin = $mid[0]; // Contains what you're looking for.
The specification isn't clear, but you can try something like this:
/value="[^"]*"/
Explanation:
First, value=" is matched literally
Then, match [^"]*, i.e. anything but ", possibly spanning multiple lines
Lastly, match " literally
This does not allow " to appear between the "real" quotes, not even if it's escaped by e.g. preceding with a backslash.
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
References
regular-expressions.info/Examples - Programming Language Constructs - Strings
Has variations on different string patterns (e.g. allowing escaped quotes)
Related questions
Difference between .*? and .* for regex
As much as is practical, negated character class is always a better option than .*?

Categories