why in the following code, in order to match the string, then we have to escape the '$' with two backslashes and not one?
<?php
$text = "$3.99";
preg_match_all("/\\$\d+\.\d{2}/", $text, $matches) ;
var_dump($matches) ;
?>
output: array (size=1)
0 =>
array (size=1)
0 => string '$3.99' (length=5)
what is the matching rule for the pattern: "/\$\d+.\d{2}/" (one backslash)
http://php.net/manual/en/language.types.string.php
From the docs
If the string is enclosed in double-quotes ("), PHP will interpret more escape sequences for special characters:
Then
\\ backslash
\$ dollar sign
So the double backslash is for the string not the regex
A single backslash would result in the $ literal which is then passed to the regex
I think php has a few options for depicting the regex string.
At the first level, they seem to want strings of some sort because
they don't have quote like operators like Perl, at least I don't think so.
Level 2, they move on to the regex delimiter (stay away from double quotes as delimiter).
Level 3, the raw regex is left bare after 1 & 2 are done.
So usually, you have reverse the process 3 - 2 -1, and present that to the php source code.
A note - Regex delimiters are a tricky business. In level 2, it is possible that you could
choose a delimiter that is un-escapable in the regular expression. In your case '$' would
not be a viable delimiter.
Some possibilities might help you...
\$\d+\.\d{2} # raw regex
/\$\d+\.\d{2}/ # no quote, using '/' for delimeter
'/\$\d+\.\d{2}/' # single quotes ""
"/\\$\\d+\\.\\d{2}/" # double quotes ""
~\$\d+\.\d{2}~ # no quote, using '~' for delimeter
'~\$\d+\.\d{2}~' # single quotes ""
"~\\$\\d+\\.\\d{2}~" # double quotes ""
Related
So I have this regex that works on regex101.com
(?:[^\#\\S\\+]*)
It matches the first from first#second.
Whenever I try to use my regex with PHP's preg_replace I don't get the result I expect.
So far I tried it via preg_quote():
\(\?\:\[\^\\#\\S\\\+\]\*\)
And tried it with escaping the original \\ with 4 \'s:
\(\?\:\[\^\\#\\\\S\\\\\+\]\*\)
Still no success. Am I doing something fundamentaly wrong?
I'm just using:
preg_replace("/$regex/", "", $string);
All my other regexes that don't need so many escape chars work perfectly that way.
When you use (?:[^\#\\S\\+]*) in a preg_match in PHP, both in a single or double quoted string literal, the \\S is parsed as a non-whitespace pattern. [^\S] is equal to \s, i.e. it matches whitespace.
The preg_quote() function is only meant to be used to make any string a literal one for a regex, it just escapes all chars that are sepcial regex metacharacters / operators (like (, ), [, etc.), thus you should not use it here.
While you could use a regex to match 1+ chars other than whitespace and # from the start of a string like preg_match('~^[^#\s]+~', $s, $match), you can just explode your input string with # and get the 0th item.
I'll give my example in PHP. I am testing if quoted strings are properly closed (e.g., quoted string must close with double quotes if begins with dq). There must be at least 1 char between the quotes, and that char-set between the quotes cannot include the same start/end quote character. For example:
$myString = "hello";// 'hello' also good but "hello' should fail
if (preg_match("/^(\")?[^\"]+(?(1)\")|(\')?[^\']+(?(1)\')$/", $myString)) {
die('1');
} else {
die('2');
}
// The string '1' is outputted which is correct
I am new to conditional regex but to me it seems that I cannot make the preg_match() any simpler. Is this correct?
To do that, there's no need to use the "conditional feature". But you need to check the string from the start until the end (in other word, you can't do it only checking a part of the string):
preg_match('~\A[^"\']*+(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
If you absolutely want at least one character inside quotes, you need to add these lookaheads (?=[^"]) and (?=[^']):
preg_match('~\A[^"\']*+(?:"(?=[^"])[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'(?=[^\'])[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
details:
~
\A # start of the string
[^"']*+ #"# all that is not a quote
(?:
" #"# opening quote
(?=[^"]) #"# at least one character that isn't a quote
[^"\\]*+ #"# all characters that are not quotes or backslashes
(?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
" #"# closing quote
[^"']*
| #"# or same thing for single quotes
'(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z # end of the string
~s # singleline mode: the dot matches newlines too
demo
Note that these patterns are designed to deal with escaped characters.
Most of the time a conditional can be replaced with a simple alternation.
As an aside: don't believe that shorter patterns are always better than longer patterns, it's a false idea.
Based on the two observations below, I built my regex to be simple and fast, but to not deal with escaped quotes
The OP was asked specifically whether the string $str = "hello, I
said: \"How are you?\"" would be invalid and did not respond
The OP mentioned performance (efficiency as a criterion)
I'm also not a fan of code that is tough to read, so I used the <<< Nowdoc notation to avoid having to escape anything in the regex pattern
My solution:
$strings = [
"'hello's the word'",
"'hello is the word'",
'"hello "there" he said"',
'"hello there he said"',
'"Hi',
"'hello",
"no quotes",
"''"
];
$regexp = <<< 'TEXT'
/^('|")(?:(?!\1).)+\1$/
TEXT;
foreach ($strings as $string):
echo "$string - ".(preg_match($regexp,$string)?'true':'false')."<br/>";
endforeach;
Output:
'hello's the word' - false
'hello is the word' - true
"hello "there" he said" - false
"hello there he said" - true
"Hi - false
'hello - false
no quotes - false
'' - false
How it works:
^('|") //starts with single or double-quote
(?: //non-capturing group
(?!\1) //next char is not the same as first single/double quote
. //advance one character
)+ //repeat group with next char (there must be at least one char)
\1$ //End with the same single or double-quote that started the string
I am trying to escape a string for use in a regular expression in PHP. So far I tried:
preg_quote(addslashes($string));
I thought I need addslashes in order to properly account for any quotes that are in the string. Then preg_quote escapes the regular expression characters.
However, the problem is that quotes are escaped with backslash, e.g. \'. But then preg_quote escapes the backslash with another one, e.g. \\'. So this leaves the quote unescaped once again. Switching the two functions does not work either because that would leave an unescaped backslash which is then interpreted as a special regular expression character.
Is there a function in PHP to accomplish the task? Or how would one do it?
The proper way is to use preg_quote and specify the used pattern delimiter.
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax... characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Trying to use a backslash as delimiter is a bad idea. Usually you pick a character, that's not used in the pattern. Commonly used is slash /pattern/, tilde ~pattern~, number sign #pattern# or percent sign %pattern%. It is also possible to use bracket style delimiters: (pattern)
Your regex with modification mentioned in comments by #CasimiretHippolyte and #anubhava.
$pattern = '/(?<![a-z])' . preg_quote($string, "/") . '/i';
Maybe wanted to use \b word boundary. No need for any additional escaping.
I use the below regular expression to only match a given character sequence if it is not surrounded by quotes - that is, if it is followed by an even number of quotes (using a positive lookahead) until the end of the string.
Say I want to match the word section only if it is not between quotes:
\bsection\b(?=[^"]*(?:"[^"]*"[^"]*)*$)
Working example on RegExr
How would I extend this to take escaped quotes into consideration? That is, if I insert a \" between the quotes in the linked example, the results stay the same.
Using pcre could skip the quoted stuff:
(?s)".*?(?<!\\)"(*SKIP)(*F)|\bsection\b
In string regex pattern have to triple-escape the backslash, like \\\\ to match a literal backslash in the lookbehind. Or in a single quoted pattern double escaping it would be sufficient for this case.
$pattern = '/".*?(?<!\\\)"(*SKIP)(*F)|\bsection\b/s';
See test at regex101.
<?php
$a='/\\\/';
$b='/\\\\/';
var_dump($a);//string '/\\/' (length=4)
var_dump($b);//string '/\\/' (length=4)
var_dump($a===$b);//boolean true
?>
Why is the string with 3 backslashes equal to the string with 4 backslashes in PHP?
And can we use the 3-backslash version in regular expression?
The PHP reference says we must use 4 backslashes.
Note:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
$b='/\\\\/';
php parses the string literal (more or less) character by character. The first input symbol is the forward slash. The result is a forward slash in the result (of the parsing step) and the input symbol (one character, the /) is taken away from the input.
The next input symbol is a backslash. It's taken from the input and the next character/symbol is inspected. It's also a backslash. That's a valid combination, so the second symbol is also taken from the input and the result is a single blackslash (for both input symbols).
The same with the third and fourth backslash.
The last input symbol (within the literal) is the forwardslash -> forwardslash in the result.
-> /\\/
Now for the string with three backslashes:
$a='/\\\/';
php "finds" the first blackslash, the next character is a blackslash - that's a valid combination resulting in one single blackslash in the result and both characters in the input literal taken.
php then "finds" the third blackslash, the next character is a forward-slash, this is not a valid combination. So the result is a single blackslash (because php loves and forgives you....) and only one character taken from the input.
The next input character is the forward-slash, resulting in a forwardslash in the result.
-> /\\/
=> both literals encode the same string.
It is explained in the documentation on the page about Strings:
Under the Single quoted section it says:
The simplest way to specify a string is to enclose it in single quotes (the character ').
To specify a literal single quote, escape it with a backslash (\). To specify a literal backslash, double it (\\). All other instances of backslash will be treated as a literal backslash.
Let's try to interpret your strings:
$a='/\\\/';
The forward slashes (/) have no special meaning in PHP strings, they represent themselves.
The first backslash (\) escapes the second backslash, as explained in the first sentence from the second paragraph quoted above.
The third backslash stands for itself, as explained in the last sentence of the above quote, because it is not followed by an apostrophe (') or a backslash (\).
As a result, the variable $a contains this string: /\\/.
On
$b='/\\\\/';
there are two backslashes (the second and the fourth) that are escaped by the first and the third backslash. The final (runtime) string is the same as for $a: /\\/.
Note
The discussion above is about the encoding of strings in PHP source. As you can see, there always is more than one (correct) way to encode the same string. Other options (beside string literals enclosed in single or double quotes, using heredoc or nowdoc syntax) is to use constants (for literal backslashes, for example) and build the strings from pieces.
For example:
define('BS', '\'); // can also use '\\', the result is the same
$c = '/'.BS.BS.'/';
uses no escaping and a single backslash. The constant BS contains a literal backslash and it is used everywhere a backslash is needed for its intrinsic value. Where a backslash is needed for escaping then a real backslash is used (there is no way to use BS for that).
The escaping in regex is a different thing. First, the regex is parsed at the runtime and at runtime $a, $b and $c above contain /\\/, no matter how they were generated.
Then, in regex a backslash that is not followed by a special character is ignored (see the difference above, in PHP it is interpreted as a literal backslash).
Combining PHP & regex
There are endless possibilities to make the things complicate. Let's try to keep them simple and put some guidelines for regex in PHP:
enclose the regex string in apostrophes ('), if it's possible; this way there are only two characters that needs to be escaped for PHP: the apostrophe and the backslash;
when parse URLs, paths or other strings that can contain forward slashes (/) use #, ~, ! or # as regex delimiter (which one is not used in the regex itself); this way there is no need to escape the delimiter when it is used inside the regex;
don't escape in regex characters when it's not needed; f.e., the dash (-) has a special meaning only when it is used in character classes; outside them it's useless to escape it (and even in character classes it can be used unquoted without having any special meaning if it is placed as the very first or the very last character inside the [...] enclosure);