Comments in preg regexes using # as delimiter? - php

With perl like regular expression syntax, you are able to make inline comments using the /x modifier and the # character to annotate comments, but what if I'm using PHP and using # as delimiter for styling reasons, any way to make a comment then?
preg_replace("/foo # This is a comment\n/x", "bar","foobar")
works but
preg_replace("#foo # This is a comment\n#x", "bar","foobar")
doesnt work, neither does //, /**/ or any common comment sequence I tried.

In a PHP regex pattern, a delimiter has more "weight" than a pattern part. If you define a delimiter as # you cannot use it as a part of another special construct. So, "#foo # This is a comment\n#x" and "#foo (?# This is a comment\n)#x" won't work as the # signals the end of the pattern space inside the regex.
When you escape a #, it becomes a literal # symbol. The "#foo \\# This is a comment\n#x" will match "foo#Thisisacomment" as once it is escaped, it is matched as a literal symbol.
So, the best advice is available on the "Delimiters" page at php.net:
If the delimiter needs to be matched inside the pattern it must be escaped using a backslash. If the delimiter appears often inside the pattern, it is a good idea to choose another delimiter in order to increase readability.

Related

Literal delimiter ( delimiter inside \Q \E block )

I've been trying to make a few of functions based on RegEx and most of them use \Q and \E as some of the RegEx pattern is user input.
So, let's say hypothetically that we're using the delimiter / and want to match it against / the function would construct something amongst the lines of /\Q/\E/.
I'm not sure why /\Q/\E/ doesn't match / but with every other delimiter it does, unless you use the same delimiter as input.
Maybe, it considers the delimiter the end, even though, it's in a literal-only block and the escape as literal. Not sure, tried a bunch.
Hopefully someone can push me into the right direction as to what workarounds there are for this issue.
It helps to understand that / is not a regex metacharacter, like * or (. It's special because you're using it to delimit the regex itself, and the only way to escape the regex delimiter is with a backslash (\/).
But you shouldn't need to use \Q and \E. The preg_quote() method takes a delimiter argument, so it correctly adds backslashes everywhere they're needed.

What are those characters in a regular expression?

I found this regex that works correctly but I didn't understand what is # (at the start) and at the end of the expression. Are not ^ and $ the start/end characters?
preg_match_all('#^/([^/]+)/([^/]+)/$#', $s, $matches);
Thanks
The matched pattern contains many /, thus the # is used as regex delimeter. These are identical
/^something$/
and
#^something$#
If you have multiple / in your pattern the 2nd example is better suited to avoid ugly masking with \/. This is how the RE would like like with using the standard // syntax:
/^\/([^\/]+)\/([^\/]+)\/$/
About #:
That's a delimiter of the regular expression itself. It's only meaning is to tell which delimiter is used for the expression. Commonly / is used, but others are possible. PCRE expressions need a delimiter with preg_match or preg_match_all.
About ^:
Inside character classes ([...]), the ^ has the meaning of not if it's the first character.
[abc] : matching a, b or c
[^abc] : NOT matching a, b or c, match every other character instead
Also # at the start and the end here are custom regex delimiters. Instead of the usual /.../ you have #...#. Just like perl.
These are delimiters. You can use any delimiter you want, but they must appear at the start and end of the regular expression.
Please see this documentation for a detail insight in to regular expressions:
http://www.php.net/manual/en/pcre.pattern.php
You can use pretty much anything as delimiters. The most common one is /.../, but if the pattern itself contains / and you don't want to escape any and all occurrences, you can use a different delimiter. My personal preference is (...) because it reminds me that $0 of the result is the entire pattern. But you can do anything, <...>, #...#, %...%, {...}... well, almost anything. I don't know exactly what the requirements are, but I think it's "any non-alphanumeric character".
Let me break it down:
# is the first character, so this is the character used as the delimiter of the regular expression - we know we've got to the end when we reach the next (unescaped) one of these
^ outside of a character class, this means the beginning of the string
/ is just a normal 'slash' character
([^/]+) This is a bracketed expression containing at least one (+) instance of any character that isn't a / (^ at the beginning of a character class inverts the character class - meaning it will only match characters that are not in this list)
/ again
([^/]+) again
/ again
$ this matches the end of the string
# this is the final delimeter, so we know that the regex is now finished.

Php regex with safe delimiters

I've thought that php's perl compatible regular expression (preg library) supports curly brackets as delimiters. This should be fine:
{ello {world}i // should match on Hello {World
The main point of curly brackets is that it only takes the most left and right ones, thus requiring no escaping for the inner ones. As far as I know, php requires the escaping
{ello \{world}i // this actually matches on Hello {World
Is this the expected behavior or bug in php preg implementation?
When in Perl you use for the pattern delimiter any of the four paired ASCII bracket types, you only need to escape unpaired brackets within the pattern. This is indeed the entire purpose of using brackets. This is documented in the perlop manpage under “Quote and Quote-like Operators”, which reads in part:
Non-bracketing delimiters use the same character fore and aft,
but the four sorts of brackets (round, angle, square, curly)
will all nest, which means that
q{foo{bar}baz}
is the same as
'foo{bar}baz'
Note, however, that this does not always work for quoting Perl code:
$s = q{ if($a eq "}") ... }; # WRONG
That’s why you often see people use m{…} or qr{…} in Perl code, especially for multiline patterns used with /x ᴀᴋᴀ (?x). For example:
return qr{
(?= # pure lookahead for conjunctive matching
\A # always from start
. *? # going only as far as we need to to find the pattern
(?:
${case_flag}
${left_boundary}
${positive_pattern}
${right_boundary}
)
)
}sxm;
Notice how those nested braces are no problem.
Expected behavior as far as I know, otherwise how else would the compiler allow group limiters? e.g.
[a-z]{1,5}
From http://lv.php.net/manual/en/regexp.reference.delimiters.php:
If the delimiter needs to be matched
inside the pattern it must be escaped
using a backslash. If the delimiter
appears often inside the pattern, it
is a good idea to choose another
delimiter in order to increase
readability.
So this is expected behavior, not a bug.
I found that no escaping is required in this case:
'ello {world'i
(ello {world)i
So my theory is, that the problem is with the '{' delimiters only. Also, the following two produce the same error:
{ello {world}i
(ello (world)i
Using starting/ending braces as delimiters may require to escape the given braces in the expression.

Regexp even number of backslashes (PHP)

I have rather hard time getting my head around regular expression, especially more complex formulas.
Currently I am writing my own markup language and am stumped by escaping. I want each special character to be "escapable", that is if *bold* would give me <b>bold</b>, then \*bold\* should leave it as-is, so I can do the stripping of backslashes later, but I can't think of a regular expression to convey this idea.
How can I select three groups:
Left asterisk if the number or BSes preceding it is even;
Content between asterisks;
Right asterisk if the number of BSes preceding it is even;
with one regular expression? I need it to be compliant with PHP's preg_replace.
This \\*(\*)\S(.)+?\S\\*(\*) would select both asterisks and content as three groups, but that doesn't check for 'evenity' and stuff.
UPDATE:
The second paragraph has been changed to better illustrate what I meant (please don't modify it anymore because the change that was made completely missed the point).
Plus, if that makes things easier, I can first parse any double backslash into some other character, so there is only need to check for ONE backslash before asterisk.
How about:
$rx = '/
([^\\]*|^) # no backslash or beginning of line
\\ # one backslash
\* # an asterisk
([^*\\]+) # one or more characters not being asterisks or BSs
\\ # one backslash
\* # one asterisk
# "mx" = multiline,extended regex
/mx';
preg_replace($rx, '\1\2', $content)
Well, I guess I found answer to my own question.
First I will have to replace each \\, and then use expression like this:
(?<!\\) #There is no backslash before...
\* #...Asterisk
( #Non-whitespace after first and before second asterisk
\S .*? \S
|
\S
)
(?<!\\) #There is no backslash before...
\* #...Asterisk
And from on here I can tweak it however I wish. Thanks for any input to anyone anyway :).

What does it mean when a regular expression is surrounded by # symbols?

Question
What does it mean when a regular expression is surrounded by # symbols? Does that mean something different than being surround by slashes? What about when #x or #i are on the end? Now that I think about it, what do the surrounding slashes even mean?
Background
I saw this StackOverflow answer, posted by John Kugelman, in which he displays serious Regex skills.
Now, I'm used to seeing regexes surrounded by slashes as in
/^abc/
But he used a regex surrounded by # symbols:
'#
^%
(.{2}) # State, 2 chars
([^^]{0,12}.) # City, 13 chars, delimited by ^
([^^]{0,34}.) # Name, 35 chars, delimited by ^
([^^]{0,28}.) # Address, 29 chars, delimited by ^
\?$
#x'
In fact, it seems to be in the format:
#^abc#x
In the process of trying to google what that means (it's a tough question to google!), I also saw the format:
#^abc#i
It's clear the x and the i are not matched characters.
So what does it all mean???
Thanks in advance for any and all responses,
-gMale
The surrounding slashes are just the regex delimiters. You can use any character (afaik) to do that - the most commonly used is the /, other I've seen somewhat commonly used is #
So in other words, #whatever#i is essentially the same as /whatever/i (i is modifier for a case-insensitive match)
The reason you might want to use something else than the / is if your regex contains the character. You avoid having to escape it, similar to using '' for strings instead of "".
Found this from a "Related" link.
The delimiter can be any character that is not alphanumeric, whitespace or a backslash character.
/ is the most commonly used delimiter, since it is closely associated with regex literals, for instance in JavaScript where they are the only valid delimiter. However, any symbol can be used.
I have seen people use ~, #, #, even ! to delimit their regexes in a way that avoids using symbols that are also in the regex. Personally I find this ridiculous.
A lesser-known fact is that you can use a matching pair of brackets to delimit a regex in PHP. This has the tremendous advantage of having an obvious difference between the closing delimiter, and the symbol showing up in the pattern, and therefore don't need any escaping. My personal preference is this:
(^abc)i
By using parentheses, I remind myself that in a match, $m[0] is always the full match, and the subpatterns start at $m[1].

Categories