Extra backslash needed in PHP regexp pattern - php

When testing an answer for another user's question I found something I don't understand. The problem was to replace all literal \t \n \r characters from a string with a single space.
Now, the first pattern I tried was:
/(?:\\[trn])+/
which surprisingly didn't work. I tried the same pattern in Perl and it worked fine. After some trial and error I found that PHP wants 3 or 4 backslashes for that pattern to match, as in:
/(?:\\\\[trn])+/
or
/(?:\\\[trn])+/
these patterns - to my surprise - both work. Why are these extra backslashes necessary?

You need 4 backslashes to represent 1 in regex because:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
From the PHP doc,
escaping any other character will result in the backslash being printed too1
Hence for \\\[,
1 backslash is used for unescaping the \, one stay because \[ is invalid ("\\\[" -> \\[)
1 backslash is used for unescaping in the regex engine (\\[ -> \[)
Yes it works, but not a good practice.

Its works in perl because you pass that directly as regex pattern /(?:\\[trn])+/
but in php, you need to pass as string, so need extra escaping for backslash itself.
"/(?:\\\\[trn])+/"
The regex \ to match a single
backslash would become '/\\\\/' as a
PHP preg string

The regular expression is just /(?:\\[trn])+/. But since you need to escape the backslashes in string declarations as well, each backslash must be expressed with \\:
"/(?:\\\\[trn])+/"
'/(?:\\\\[trn])+/'
Just three backspaces do also work because PHP doesn’t know the escape sequence \[ and ignores it. So \\ will become \ but \[ will stay \[.

Use str_replace!
$code = str_replace(array("\t","\n","\r"),'',$code);
Should do the trick

Related

Double escaping hexadecimal characters ie \\x80 - \\xFF

I have finally started to understand the context behind escaping hexadecimal characters such as \x80. The documentation talks about the escape sequences, but I can also see that some regular expression use double backslashes such as \\x80 - \\xFF.
What's the difference between \\x80 - \\xFF and \x80 - \xFF when using something like preg_replace ?
When using preg_ functions, your string is parsed twice - first, by php compiler, and then by the PCRE engine. So if you have, for example:
preg_match("/\x80/"....)
the compiler turns it into
preg_match("/�/"....) // let � be chr(80)
and passes this to PCRE. When you have two slashes:
preg_match("/\\x80/"....)
the compiler turns the string into
preg_match("/\x80/"....)
and then it's the PCRE engine that converts this to the literal character �.
It doesn't make a difference in this particular case, but consider:
preg_match("/\x5B/"....)
after compilation
preg_match("/[/"....)
and PCRE fails, because of the dangling metacharacter [. Now if you escape the slash
preg_match("/\\x5B/"....)
it's compiled to
preg_match("/\x5B/"....)
which makes PCRE happy, because it understands that [ should be taken literally.
How exactly php compiles your string depends on the quotes you use: double/single/heredocs/nowdocs. See docs for details. A simple rule of thumb is to use single quotes when possible, if you have to use doubles (for variable interpolation), escape everything twice, even if there's technically no need (e.g "\\b$word\\b").
To write hex x80, you use \ and that way you get \x80.
Now in PHP string \ escapes special characters. In string "$var" PHP will try to insert variable $var in that string (because string uses ". To escape $ you write "\$var" and output will be just simple string $var.
Now to write \ in string (no matter if it uses " or ') you use same escaping character \. So it becomes \\ to output \.
If you write "\x80" your output will be "x80" (without \). Than you escape \ with another \ => "\\x80" outputs "\x80".
So to summarize everything:
\x80 is hex character, and when you write it inside string, you write \\x80.
Just some fun:
PHP that outputs js function to alert \x80:
echo "function alertHex(){
alert('\\\\x80 - \\\\xFF');
}";
Why 4 x \? First you escape PHP string to get alert('\\x80 - \\xFF'), that you escape JS string to get \x80 - \xFF.
Same with preg_replace: Allowed symbols: \, $, a-z, [, ]: patern: \\\$[a-z]\[\]; preg_replace('\\\\\$[a-z]\\[\\]', '', $str);

PHP/Regex: Get content between Stars but not if there's a leading backslash

I would like to get everything between two stars - except of they have a leading backslash.
So for example:
*hello* world
should return "hello", but
*hello \* world*
should return "hello * world"
I tried the following regex:
/(?<!\\)\*(.+?)(?<!\\)\*/s
which works perfect on http://regex101.com/ but php returns:
Warning: preg_replace(): Compilation failed: missing ) at offset 21
What am I doing wrong?
--
EDIT 1:
Here's my PHP-Code for that:
var_dump(preg_replace('/(?<!\\)\*(.+?)(?<!\\)\*/s', '<strong>$1</strong>', '*hello world*'));
You are not escaping the backslashes correctly which results in escaping the ) character.
To match a \ in PHP you need 4 backslashes
/(?<!\\\\)\*(.+?)(?<!\\\\)\*/s
It must be done like this because every backslash in a C-like string
must be escaped by a backslash. That would give us a regular
expression with 2 backslashes, as you might have assumed at first.
However, each backslash in a regular expression must be escaped by a
backslash, too. This is the reason that we end up with 4 backslashes.
Or use a character class with 2 backslashes
/(?<![\\])\*(.+?)(?<![\\])\*/s
A literal backslash can also be matched using preg_match() by using a
character class instead. Backslashes are not escaped when they appear
within character classes in regular expressions. Therefore (“[\]“)
would match a literal backslash. The backslash must still be escaped
once by another backslash because it is still a C-like string.
Edit Found this article which explains why this is necessary. Also, added explanations.
You can use this regex:
\*(.*?(?<!\\))\*
Working demo

Backslash in Regex- PHP

I am trying to learn Regex in PHP and stuck in here now. My ques may appear silly but pls do explain.
I went through a link:
Extra backslash needed in PHP regexp pattern
But I just could not understand something:
In the answer he mentions two statements:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
My ques:
what does the word "unescaping" actually means? what is the purpose of unescaping?
Why do we need 4 backslashes to include it in the regex?
The backslash has a special meaning in both regexen and PHP. In both cases it is used as an escape character. For example, if you want to write a literal quote character inside a PHP string literal, this won't work:
$str = ''';
PHP would get "confused" which ' ends the string and which is part of the string. That's where \ comes in:
$str = '\'';
It escapes the special meaning of ', so instead of terminating the string literal, it is now just a normal character in the string. There are more escape sequences like \n as well.
This now means that \ is a special character with a special meaning. To escape this conundrum when you want to write a literal \, you'll have to escape literal backslashes as \\:
$str = '\\'; // string literal representing one backslash
This works the same in both PHP and regexen. If you want to write a literal backslash in a regex, you have to write /\\/. Now, since you're writing your regexen as PHP strings, you need to double escape them:
$regex = '/\\\\/';
One pair of \\ is first reduced to one \ by the PHP string escaping mechanism, so the actual regex is /\\/, which is a regex which means "one backslash".
I think you can use "preg_quote()":
http://php.net/preg_quote
This function escapes special chars, so you can give an input as it is, without escaping by yourself:
<?php
$string = "online 24/7. Only for \o/";
$escaped_string = preg_quote($string, "/"); // 2nd param is optional and used if you want to escape also the delimiter of your regex
echo $escaped_string; // $escaped_string: "online 24\/7. Only for \\o\/"
?>

How to quote a Regex Delimeter that is an escaped (literal) metacharacter (Perl / PHP)?

I'm not sure on this and think this is impossible, but I thought I'd ask anyway.
I would like to use a regex delimeter that is a metachar. Example would be
brackets, parenthesis, etc.. [ ], ( ), ...
but anything really.
Its not that I need to do it, its that I'm trying to write an escaping routine as part of a project.
So, whats the problem? The problem comes in the regex body when its not really a metachar
its a literal, like:
/ \( \) / where the forward slash delimeters are to be replaced with ( and )
In Perl for instance, these won't work
=~ m( \( \) )
=~ m( \\( \\) )
=~ m( \\\( \\\) )
=~ m( \\\\( \\\\) )
No amount of escaping the parenthesis will alow a single backslash, ie a literal \(
The backslash on the delimeter is always removed, the remainder of backslashes are then subject to normal quoting rules. This always results in an even number of backslashes.
PHP is apparently the same way.
Like I said, I wouldn't use meta characters as delimeters in normal operation, this
is just a utility I'm tring to write (which seems in jepardy right now).
I'm trying to use just basic escaping rules and want avoid having to scan the string
ahead of time comparing selected delimeters for literal (escaped) meta characters in
the regex text body.
Perl uses q() and qq() that does this correctly (not qr() unfortunately).
It does this by removing escapes on escapes and escapes on delimeters at the same time.
So q( \\\( \\\) ) results in \( \).
Thanks for any help.
Edit
After some research I found this to be impossible, so utility is scrapped.
Thanks for the valuable input though. I'm fairly impressed with Perl's array of
quoting options, especially 'quote like operators' which does the job
but the delimeter is then really for the quote operator and not a regex.
[ I'm not sure if you're asking about Perl or PHP. I just know about Perl ]
Regex literals are parsed twice, once by the Perl compiler and once by the regex compiler.
The Perl parser finds the end of the literal while handling interpolation, escaped delimiters and sequences like \Q and \L. This produces the regex pattern (as a string) and the matching options (e.g. case-insensitive matching).
qr/\/\(/ produces the pattern /\( (/ got unescaped). Similarly,
qr(\/\() produces the pattern \/( (( got unescaped).
The regex compiler takes the regex patter and the matching options and returns a compiled regex.
/\( produces a regex that matches exactly /(, while
\/( produces a regex syntax error.
To produce a regex that matches exactly (, you would need to produce the pattern \( or equivalent. Here are your options:
qr/\(/ (Don't use it as a delimiter)
$d='('; qr(\Q$d\E) (Don't use it in the literal)
qr(\Q\(\E) (Use \Q to insert an escape after \( has become ()
qr(\x28) (Use something equivalent)
qr([\(]) (Use it in a way that doesn't require it being escaped)
You best option by far is to simply choose a different delimiter: One that isn't a meta char, or one that's not used in the pattern. This is trivial since it only matters for hardcoded patterns.
I do not know about PHP, but you can use the \Q in Perl:
"()" =~ m(\Q\(\)\E) and print "YES\n"
Using one-member character classes should work in both Perl and PHP:
"()" =~ m([(][)]) and print "YES\n"
Can you develop your example a bit more precise?
Because
If the original string -> '\('
then /[\\][(]/ will match it

PHP preg_replace not working as intended

I am trying to replace /admin and \admin from the following two strings:
F:\dev\htdocs\cms\admin
http://localhost/cms/admin
Using the following regular expression in preg_replace:
/[\/\\][a-zA-Z0-9_-]*$/i
1) From the first string it just replaces admin where as it should replace \admin
2) From the second string it replaces every thing except http: where as it should replace only /admin
I have checked this expression on http://regexpal.com/ and it works perfect there but not in PHP.
Any idea?
Note that the last part of each string admin is not fixed, it can
be any user selected value and thats why I have used [a-zA-Z0-9_-]* in
regular expression.
The original regular expression should be /[\/\\][a-zA-Z0-9_-]*$/i, but since you need to escape the backslashes in string declarations as well, each backslash must be expressed with \\ -- 4 backslashes in total.
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
So, your preg_replace() statement should look like:
echo preg_replace('/[\/\\\\][a-zA-Z0-9_-]*$/i', '', $str);
Your regex can be improved as follows:
echo preg_replace('~[/\\\\][\w-]*$~', '', $str);
Here, \w matches the ASCII characters [A-Za-z0-9_]. You can also avoid having to escape the forward slash / by using a different delimiter -- I've used ~ above.
[\/\\\][a-zA-Z0-9_-]*$/i
Live demo

Categories