I have finally started to understand the context behind escaping hexadecimal characters such as \x80. The documentation talks about the escape sequences, but I can also see that some regular expression use double backslashes such as \\x80 - \\xFF.
What's the difference between \\x80 - \\xFF and \x80 - \xFF when using something like preg_replace ?
When using preg_ functions, your string is parsed twice - first, by php compiler, and then by the PCRE engine. So if you have, for example:
preg_match("/\x80/"....)
the compiler turns it into
preg_match("/�/"....) // let � be chr(80)
and passes this to PCRE. When you have two slashes:
preg_match("/\\x80/"....)
the compiler turns the string into
preg_match("/\x80/"....)
and then it's the PCRE engine that converts this to the literal character �.
It doesn't make a difference in this particular case, but consider:
preg_match("/\x5B/"....)
after compilation
preg_match("/[/"....)
and PCRE fails, because of the dangling metacharacter [. Now if you escape the slash
preg_match("/\\x5B/"....)
it's compiled to
preg_match("/\x5B/"....)
which makes PCRE happy, because it understands that [ should be taken literally.
How exactly php compiles your string depends on the quotes you use: double/single/heredocs/nowdocs. See docs for details. A simple rule of thumb is to use single quotes when possible, if you have to use doubles (for variable interpolation), escape everything twice, even if there's technically no need (e.g "\\b$word\\b").
To write hex x80, you use \ and that way you get \x80.
Now in PHP string \ escapes special characters. In string "$var" PHP will try to insert variable $var in that string (because string uses ". To escape $ you write "\$var" and output will be just simple string $var.
Now to write \ in string (no matter if it uses " or ') you use same escaping character \. So it becomes \\ to output \.
If you write "\x80" your output will be "x80" (without \). Than you escape \ with another \ => "\\x80" outputs "\x80".
So to summarize everything:
\x80 is hex character, and when you write it inside string, you write \\x80.
Just some fun:
PHP that outputs js function to alert \x80:
echo "function alertHex(){
alert('\\\\x80 - \\\\xFF');
}";
Why 4 x \? First you escape PHP string to get alert('\\x80 - \\xFF'), that you escape JS string to get \x80 - \xFF.
Same with preg_replace: Allowed symbols: \, $, a-z, [, ]: patern: \\\$[a-z]\[\]; preg_replace('\\\\\$[a-z]\\[\\]', '', $str);
Related
I would like to get everything between two stars - except of they have a leading backslash.
So for example:
*hello* world
should return "hello", but
*hello \* world*
should return "hello * world"
I tried the following regex:
/(?<!\\)\*(.+?)(?<!\\)\*/s
which works perfect on http://regex101.com/ but php returns:
Warning: preg_replace(): Compilation failed: missing ) at offset 21
What am I doing wrong?
--
EDIT 1:
Here's my PHP-Code for that:
var_dump(preg_replace('/(?<!\\)\*(.+?)(?<!\\)\*/s', '<strong>$1</strong>', '*hello world*'));
You are not escaping the backslashes correctly which results in escaping the ) character.
To match a \ in PHP you need 4 backslashes
/(?<!\\\\)\*(.+?)(?<!\\\\)\*/s
It must be done like this because every backslash in a C-like string
must be escaped by a backslash. That would give us a regular
expression with 2 backslashes, as you might have assumed at first.
However, each backslash in a regular expression must be escaped by a
backslash, too. This is the reason that we end up with 4 backslashes.
Or use a character class with 2 backslashes
/(?<![\\])\*(.+?)(?<![\\])\*/s
A literal backslash can also be matched using preg_match() by using a
character class instead. Backslashes are not escaped when they appear
within character classes in regular expressions. Therefore (“[\]“)
would match a literal backslash. The backslash must still be escaped
once by another backslash because it is still a C-like string.
Edit Found this article which explains why this is necessary. Also, added explanations.
You can use this regex:
\*(.*?(?<!\\))\*
Working demo
I am trying to learn Regex in PHP and stuck in here now. My ques may appear silly but pls do explain.
I went through a link:
Extra backslash needed in PHP regexp pattern
But I just could not understand something:
In the answer he mentions two statements:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
My ques:
what does the word "unescaping" actually means? what is the purpose of unescaping?
Why do we need 4 backslashes to include it in the regex?
The backslash has a special meaning in both regexen and PHP. In both cases it is used as an escape character. For example, if you want to write a literal quote character inside a PHP string literal, this won't work:
$str = ''';
PHP would get "confused" which ' ends the string and which is part of the string. That's where \ comes in:
$str = '\'';
It escapes the special meaning of ', so instead of terminating the string literal, it is now just a normal character in the string. There are more escape sequences like \n as well.
This now means that \ is a special character with a special meaning. To escape this conundrum when you want to write a literal \, you'll have to escape literal backslashes as \\:
$str = '\\'; // string literal representing one backslash
This works the same in both PHP and regexen. If you want to write a literal backslash in a regex, you have to write /\\/. Now, since you're writing your regexen as PHP strings, you need to double escape them:
$regex = '/\\\\/';
One pair of \\ is first reduced to one \ by the PHP string escaping mechanism, so the actual regex is /\\/, which is a regex which means "one backslash".
I think you can use "preg_quote()":
http://php.net/preg_quote
This function escapes special chars, so you can give an input as it is, without escaping by yourself:
<?php
$string = "online 24/7. Only for \o/";
$escaped_string = preg_quote($string, "/"); // 2nd param is optional and used if you want to escape also the delimiter of your regex
echo $escaped_string; // $escaped_string: "online 24\/7. Only for \\o\/"
?>
To match a literal backslash, many people and the PHP manual say: Always triple escape it, like this \\\\
Note:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
Here is an example string: \test
$test = "\\test"; // outputs \test;
// WON'T WORK: pattern in double-quotes double-escaped backslash
#echo preg_replace("~\\\t~", '', $test); #output -> \test
// WORKS: pattern in double-quotes with triple-escaped backslash
#echo preg_replace("~\\\\t~", '', $test); #output -> est
// WORKS: pattern in single-quotes with double-escaped backslash
#echo preg_replace('~\\\t~', '', $test); #output -> est
// WORKS: pattern in double-quotes with double-escaped backslash inside a character class
#echo preg_replace("~[\\\]t~", '', $test); #output -> est
// WORKS: pattern in single-quotes with double-escaped backslash inside a character class
#echo preg_replace('~[\\\]t~', '', $test); #output -> est
Conclusion:
If the pattern is single-quoted, a backslash has to be double-escaped \\\ to match a literal \
If the pattern is double-quoted, it depends whether
the backlash is inside a character-class where it must be at least double-escaped \\\
outside a character-class it has to be triple-escaped \\\\
Who can show me a difference, where a double-escaped backslash in a single-quoted pattern e.g. '~\\\~' would match anything different than a triple-escaped backslash in a double-quoted pattern e.g. "~\\\\~" or fail.
When/why/in what scenario would it be wrong to use a double-escaped \ in a single-quoted pattern e.g. '~\\\~' for matching a literal backslash?
If there's no answer to this question, I would continue to always use a double-escaped backslash \\\ in a single-quoted PHP regex pattern to match a literal \ because there's possibly nothing wrong with it.
A backslash character (\) is considered to be an escape character by both PHP's parser and the regular expression engine (PCRE). If you write a single backslash character, it will be considered as an escape character by PHP parser. If you write two backslashes, it will be interpreted as a literal backslash by PHP's parser. But when used in a regular expression, the regular expression engine picks it up as an escape character. To avoid this, you need to write four backslash characters, depending upon how you quote the pattern.
To understand the difference between the two types of quoting patterns, consider the following two var_dump() statements:
var_dump('~\\\~');
var_dump("~\\\\~");
Output:
string(4) "~\\~"
string(4) "~\\~"
The escape sequence \~ has no special meaning in PHP when it's used in a single-quoted string. Three backslashes do also work because the PHP parser doesn't know about the escape sequence \~. So \\ will become \ but \~ will remain as \~.
Which one should you use:
For clarity, I'd always use ~\\\\~ when I want to match a literal backslash. The other one works too, but I think ~\\\\~ is more clear.
There is no difference between the actual escaping of the slash in either single or double quoted strings in PHP - as long as you do it correct. The reason why you're getting a WONT WORK on your first example is, as pointed out in the comments, it expands \t to the tab meta character.
When you're using just three backslashes, the last one in your single quoted string will be interpreted as \~, which as far as single quoted strings go, will be left as it is (since it does not match a valid escape sequence). It is however just a coincidence that this will be parsed as you expect in this case, and not have some sort of side effect (i.e, \\\' would not behave the same way).
The reason for all the escaping is that the regular expression also needs backslashes escaped in certain situations, as they have special meaning there as well. This leads to the large number of backslashes after each other, such as \\\\ (which takes eight backslashes for the markdown parser, as it yet again adds another level of escaping).
Hopefully that clears it up, as you seem to be confused regarding the handling of backslashes in single/double quoted strings more than the behaviour in the regular expression itself (which will be the same regardless of " or ', as long as you escape things correctly).
Using PHP 5.3.1 on windows.
I am just trying to add spaces between numbers and letters, but PHP is mangling my data!
$text = "TUES:8:30AM-5:00PMTHURS:8:30AM-5:00PMSAT:8:00AM-1:00PM";
echo preg_replace("/([0-9]+)([A-Z]+)/","\1 \2",$text);
> TUES:8:☺ ☻AM-5:☺ ☻PMTHURS:8:☺ ☻AM-5:☺ ☻PMSAT:8:☺ ☻AM-1:☺ ☻PM
My file type ANSI, no there is no unicode in the source.
What the fun is going on here?
try using $ are your backreference indicator, not '\':
echo preg_replace("/(\d)(\w)/","$1 $2",$text);
I'm betting \1 is getting translated to something funky... notice the strange characters don't change between the minutes input being '30' and '00'
the php manual says you should double-escape your backreference, or use $ (if you are using a version 4.04 or newer)
You should use double backslash when you using them in string separated by double quotes:
echo preg_replace("/(\d)(\w)/","\\1 \\2",$text);
The \1 and \2 are being escaped by PHP, and being interpreted as ASCII codes 1 and 2, which in most standard Windows fonts show up as the two smiley faces you're seeing (when I run the same program on my Linux box, I get character code symbols 0001 and 0002 instead of the smiley faces).
If you want to actually use the regex replacement symbols, you need to do one of two things:
Use single quotes for your regex strings, so that the slashes aren't used as escaping characters by PHP:
preg_replace('/(\d)(\w)/','\1 \2',$text);
Use double-quotes, but escape the slashes:
preg_replace("/(\\d)(\\w)/","\\1 \\2",$text);
I'd suggest the single quote solution as it's easier to read.
Be aware that with double quotes, PHP escaping will always take precedence over regex escaping. This can affect both your regex pattern and the replacement strings. Many PHP escaped characters are the same for regex anyway - for example, \n will work the same in the regex pattern regardless of whether it is escaped by PHP or by regex. But there are some which do not work the same - as you've discovered - so you need to be careful.
When testing an answer for another user's question I found something I don't understand. The problem was to replace all literal \t \n \r characters from a string with a single space.
Now, the first pattern I tried was:
/(?:\\[trn])+/
which surprisingly didn't work. I tried the same pattern in Perl and it worked fine. After some trial and error I found that PHP wants 3 or 4 backslashes for that pattern to match, as in:
/(?:\\\\[trn])+/
or
/(?:\\\[trn])+/
these patterns - to my surprise - both work. Why are these extra backslashes necessary?
You need 4 backslashes to represent 1 in regex because:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
From the PHP doc,
escaping any other character will result in the backslash being printed too1
Hence for \\\[,
1 backslash is used for unescaping the \, one stay because \[ is invalid ("\\\[" -> \\[)
1 backslash is used for unescaping in the regex engine (\\[ -> \[)
Yes it works, but not a good practice.
Its works in perl because you pass that directly as regex pattern /(?:\\[trn])+/
but in php, you need to pass as string, so need extra escaping for backslash itself.
"/(?:\\\\[trn])+/"
The regex \ to match a single
backslash would become '/\\\\/' as a
PHP preg string
The regular expression is just /(?:\\[trn])+/. But since you need to escape the backslashes in string declarations as well, each backslash must be expressed with \\:
"/(?:\\\\[trn])+/"
'/(?:\\\\[trn])+/'
Just three backspaces do also work because PHP doesn’t know the escape sequence \[ and ignores it. So \\ will become \ but \[ will stay \[.
Use str_replace!
$code = str_replace(array("\t","\n","\r"),'',$code);
Should do the trick