Is this conditional regex the most efficient? - php

I'll give my example in PHP. I am testing if quoted strings are properly closed (e.g., quoted string must close with double quotes if begins with dq). There must be at least 1 char between the quotes, and that char-set between the quotes cannot include the same start/end quote character. For example:
$myString = "hello";// 'hello' also good but "hello' should fail
if (preg_match("/^(\")?[^\"]+(?(1)\")|(\')?[^\']+(?(1)\')$/", $myString)) {
die('1');
} else {
die('2');
}
// The string '1' is outputted which is correct
I am new to conditional regex but to me it seems that I cannot make the preg_match() any simpler. Is this correct?

To do that, there's no need to use the "conditional feature". But you need to check the string from the start until the end (in other word, you can't do it only checking a part of the string):
preg_match('~\A[^"\']*+(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
If you absolutely want at least one character inside quotes, you need to add these lookaheads (?=[^"]) and (?=[^']):
preg_match('~\A[^"\']*+(?:"(?=[^"])[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'(?=[^\'])[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
details:
~
\A # start of the string
[^"']*+ #"# all that is not a quote
(?:
" #"# opening quote
(?=[^"]) #"# at least one character that isn't a quote
[^"\\]*+ #"# all characters that are not quotes or backslashes
(?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
" #"# closing quote
[^"']*
| #"# or same thing for single quotes
'(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z # end of the string
~s # singleline mode: the dot matches newlines too
demo
Note that these patterns are designed to deal with escaped characters.
Most of the time a conditional can be replaced with a simple alternation.
As an aside: don't believe that shorter patterns are always better than longer patterns, it's a false idea.

Based on the two observations below, I built my regex to be simple and fast, but to not deal with escaped quotes
The OP was asked specifically whether the string $str = "hello, I
said: \"How are you?\"" would be invalid and did not respond
The OP mentioned performance (efficiency as a criterion)
I'm also not a fan of code that is tough to read, so I used the <<< Nowdoc notation to avoid having to escape anything in the regex pattern
My solution:
$strings = [
"'hello's the word'",
"'hello is the word'",
'"hello "there" he said"',
'"hello there he said"',
'"Hi',
"'hello",
"no quotes",
"''"
];
$regexp = <<< 'TEXT'
/^('|")(?:(?!\1).)+\1$/
TEXT;
foreach ($strings as $string):
echo "$string - ".(preg_match($regexp,$string)?'true':'false')."<br/>";
endforeach;
Output:
'hello's the word' - false
'hello is the word' - true
"hello "there" he said" - false
"hello there he said" - true
"Hi - false
'hello - false
no quotes - false
'' - false
How it works:
^('|") //starts with single or double-quote
(?: //non-capturing group
(?!\1) //next char is not the same as first single/double quote
. //advance one character
)+ //repeat group with next char (there must be at least one char)
\1$ //End with the same single or double-quote that started the string

Related

Regex for matching single-quoted strings fails with PHP

So I have this regex:
/'((?:[^\\']|\\.)*)'/
It is supposed to match single-quoted strings while ignoring internal, escaped single quotes \'
It works here, but when executed with PHP, I get different results. Why is that?
This might be easier using negative lookbehind. Note also that you need to escape the slashes twice - once to tell PHP that you want a literal backslash, and then again to tell the regex engine that you want a literal backslash.
Note also that your capturing expression (.*) is greedy - it will capture everything between ' characters, including other ' characters, whether they are escaped or not. If you want it to stop after the first unescaped ', use .*? instead. I have used the non-greedy version in my example below.
<?php
$test = "This is a 'test \' string' for regex selection";
$pattern = "/(?<!\\\\)'(.*?)(?<!\\\\)'/";
echo "Test data: $test\n";
echo "Pattern: $pattern\n";
if (preg_match($pattern, $test, $matches)) {
echo "Matches:\n";
var_dump($matches);
}
This is kinda escaping hell. Despite the fact that there's already an accepted answer, the original pattern is actually better. Why? It allows escaping the escape character using the
Unrolling the loop technique described by Jeffery Friedl in "Mastering Regular Expressions": "([^\\"]*(?:\\.[^\\"]*)*)" (adapted for single quotes)
Demo
Unrolling the Loop (using double quotes)
" # the start delimiter
([^\\"]* # anything but the end of the string or the escape char
(?:\\. # the escape char preceding an escaped char (any char)
[^\\"]* # anything but the end of the string or the escape char
)*) # repeat
" # the end delimiter
This does not resolve the escaping hell but you have been covered here as well:
Sample Code:
$re = '/\'([^\\\\\']*(?:\\\\.[^\\\\\']*)*)\'/';
$str = '\'foo\', \'can\\\'t\', \'bar\'
\'foo\', \' \\\'cannott\\\'\\\\\', \'bar\'
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);

php -> preg_replace -> remove space ONLY between quotes

I'm trying to remove space ONLY between quotes like:
$text = 'good with spaces "here all spaces should be removed" and here also good';
can someone help with a working piece of code ? I already tried:
$regex = '/(\".+?\")|\s/';
or
$regex = '/"(?!.?\s+.?)/';
without success, and I found a sample that works in the wrong direction :-(
Removing whitespace-characters, except inside quotation marks in PHP? but I can't change it.
thx Newi
This kind of problem are easily solved with preg_replace_callback. The idea consists to extract the substring between quotes and then to edit it in the callback function:
$text = preg_replace_callback('~"[^"]*"~', function ($m) {
return preg_replace('~\s~', '#', $m[0]);
}, $text);
It's the most simple way.
It's more complicated to do it with a single pattern with preg_replace but it's possible:
$text = preg_replace('~(?:\G(?!\A)|")[^"\s]*\K(?:\s|"(*SKIP)(*F))~', '#', $text);
demo
Pattern details:
(?:
\G (?!\A) # match the next position after the last successful match
|
" # or the opening double quote
)
[^"\s]* # characters that aren't double quotes or a whitespaces
\K # discard all characters matched before from the match result
(?:
\s # a whitespace
|
" # or the closing quote
(*SKIP)(*F) # force the pattern to fail and to skip the quote position
# (this way, the closing quote isn't seen as an opening quote
# in the second branch.)
)
This way uses the \G anchors to ensure that all matched whitespaces are between the quotes.
Edge cases:
there's an orphan opening quote: In this case, all whitespaces from the last quote until the end of the string are replaced. But if you want you can change this behavior adding a lookahead to check if the closing quote exists:
~(?:\G(?!\A)|"(?=[^"]*"))[^"\s]*\K(?:\s|"(*SKIP)(*F))~
double quotes can contain escaped double quotes that have to be ignored: You have to describe escaped characters like this:
~(?:\G(?!\A)|")[^"\s\\\\]*+(?:\\\\\S[^"\s\\\\]*)*+(?:\\\\?\K\s|"(*SKIP)(*F))~
Other strategy suggested by #revo: check if the number of remaining quotes at a position is odd or even using a lookahead:
\s(?=[^"]*+(?:"[^"]*"[^"]*)*+")
It is a short pattern, but it can be problematic with long strings since for each position with a whitespace you have to check the string until the last quote with the lookahead.
See the following code snippet:
<?php
$text = 'good with spaces "here all spaces should be removed" and here also good';
echo "$text \n";
$regex = '/(\".+?\")|\s/';
$regex = '/"(?!.?\s+.?)/';
$text = preg_replace($regex,'', $text);
echo "$text \n";
?>
I found a sample that works in the wrong direction :-(
#Graham: correct
$text = 'good with spaces "here all spaces should be removed" and here also good'
should be
$text = 'good with spaces "hereallspacesshouldberemoved" and here also good';

regular expressions escaping rules- Perl-compatible regular expressions

why in the following code, in order to match the string, then we have to escape the '$' with two backslashes and not one?
<?php
$text = "$3.99";
preg_match_all("/\\$\d+\.\d{2}/", $text, $matches) ;
var_dump($matches) ;
?>
output: array (size=1)
0 =>
array (size=1)
0 => string '$3.99' (length=5)
what is the matching rule for the pattern: "/\$\d+.\d{2}/" (one backslash)
http://php.net/manual/en/language.types.string.php
From the docs
If the string is enclosed in double-quotes ("), PHP will interpret more escape sequences for special characters:
Then
\\ backslash
\$ dollar sign
So the double backslash is for the string not the regex
A single backslash would result in the $ literal which is then passed to the regex
I think php has a few options for depicting the regex string.
At the first level, they seem to want strings of some sort because
they don't have quote like operators like Perl, at least I don't think so.
Level 2, they move on to the regex delimiter (stay away from double quotes as delimiter).
Level 3, the raw regex is left bare after 1 & 2 are done.
So usually, you have reverse the process 3 - 2 -1, and present that to the php source code.
A note - Regex delimiters are a tricky business. In level 2, it is possible that you could
choose a delimiter that is un-escapable in the regular expression. In your case '$' would
not be a viable delimiter.
Some possibilities might help you...
\$\d+\.\d{2} # raw regex
/\$\d+\.\d{2}/ # no quote, using '/' for delimeter
'/\$\d+\.\d{2}/' # single quotes ""
"/\\$\\d+\\.\\d{2}/" # double quotes ""
~\$\d+\.\d{2}~ # no quote, using '~' for delimeter
'~\$\d+\.\d{2}~' # single quotes ""
"~\\$\\d+\\.\\d{2}~" # double quotes ""

Why does \\. equal \. in preg_replace?

In the top-voted answer to this fantastic question, the following regular expression is used in a preg_replace call (from the answer's auto_version function):
'{\\.([^./]+)$}'
The end goal of this regular expression is to extract the file's extension from the given filename. However, I'm confused about why the very beginning of this regular expression works. Namely:
Why does \\. match the same way as \. in a regex?
Shouldn't the former match (a) one literal backslash, followed by (b) any character, while the second matches one literal period? The rules for single quoted strings state that \\ yields a literal backslash.
Consider this simple example:
$regex1 = '{\.([^./]+)$}'; // Variant 1 (one backslash)
$regex2 = '{\\.([^./]+)$}'; // Variant 2 (two backslashes)
$subject1 = '/css/foobar.css'; // Regular path
$subject2 = '/css/foobar\\.css'; // Literal backslash before period
echo "<pre>\n";
echo "Subject 1: $subject1\n";
echo "Subject 2: $subject2\n\n";
echo "Regex 1: $regex1\n";
echo "Regex 2: $regex2\n\n";
// Test Variant 1
echo preg_replace($regex1, "-test.\$1", $subject1) . "\n";
echo preg_replace($regex1, "-test.\$1", $subject2) . "\n\n";
// Test Variant 2
echo preg_replace($regex2, "-test.\$1", $subject1) . "\n";
echo preg_replace($regex2, "-test.\$1", $subject2) . "\n\n";
echo "</pre>\n";
The output is:
Subject 1: /css/foobar.css
Subject 2: /css/foobar\.css
Regex 1: {\.([^./]+)$} <-- Output matches regex 2
Regex 2: {\.([^./]+)$} <-- Output matches regex 1
/css/foobar-test.css
/css/foobar\-test.css
/css/foobar-test.css
/css/foobar\-test.css
Long story short: why should \\. yield the same matched results in a preg_replace call as \.?
Consider that there is double escaping going on: PHP sees \\. and says "OK, this is really \.". Then the regex engine sees \. and says "OK, this means a literal dot".
If you remove the first backslash, PHP sees \. and says "this is a backslash followed by a random character -- not a single quote or a backslash as per the spec -- so it remains \.". The regex engine again sees \. and gives the same result as above.
An addition to the perfectly correct answer by Jon:
Please consider the usage of the different kind of quotes (" vs '). If you use ' you cannot include control characters (like a new line). With " this is possible, by using the special key combinations \? where ? can be different things (like \n, \t, etc..). So, if you want to have a real \ in your double-quoted string, you need to escape the backslash by using \\. Please note, that this is not necessary, when using single quotes.

get initialized string regex

kNO = "Get this value now if you can";
How do I get Get this value now if you can from that string? It looks easy but I don't know where to start.
Start by reading PHP PCRE and see the examples. For your question:
$str = 'kNO = "Get this value now if you can";';
preg_match('/kNO\s+=\s+"([^"]+)"/', $str, $m);
echo $m[1]; // Get this value now if you can
Explanation:
kNO Match with "kNO" in the input string
\s+ Follow by one or more whitespace
"([^"]+)" Get any characters within double-quotes
Depending on how you're getting that input, you could use parse_ini_file or parse_ini_string. Dead simple.
Use character classes to start extracting from one open quote to the next:
$str = 'kNO = "Get this value now if you can";'
preg_match('~"([^"]*)"~', $str, $matches);
print_r($matches[1]);
Explanation:
~ //php requires explicit regex bounds
" //match the first literal double quotation
( //begin the capturing group, we want to omit the actual quotes from the result so group the relevant results
[^"] //charater class, matches any character that is NOT a double quote
* //matches the aforementioned character class zero or more times (empty string case)
) //end group
" //closing quote for the string.
~ //close the boundary.
EDIT, you may also want to account for escaped quotes, use the following regex instead:
'~"((?:[^\\\\"]+|\\\\.)*)"~'
This pattern is slightly more difficult to wrap your head around. Essentially this is broken into two possible matches (seperated by the Regex OR character |)
[^\\\\"]+ //match any character that is NOT a backslash and is NOT a double quote
| //or
\\\\. //match a backslash followed by any character.
The logic is pretty straightforward, the first character class will match all characters except a double quote or a backslash. If a quote or a backslash is found, the regex attempts to match the 2nd part of the group. In the event that it's a backslash, it will of course match the pattern \\\\., but it will also advance the match by 1 character, effectively skipping whatever escaped character followed the backslash. The only time this pattern will stop matching is when a lone, unescaped double quote is encountered,

Categories