Php regex with safe delimiters - php

I've thought that php's perl compatible regular expression (preg library) supports curly brackets as delimiters. This should be fine:
{ello {world}i // should match on Hello {World
The main point of curly brackets is that it only takes the most left and right ones, thus requiring no escaping for the inner ones. As far as I know, php requires the escaping
{ello \{world}i // this actually matches on Hello {World
Is this the expected behavior or bug in php preg implementation?

When in Perl you use for the pattern delimiter any of the four paired ASCII bracket types, you only need to escape unpaired brackets within the pattern. This is indeed the entire purpose of using brackets. This is documented in the perlop manpage under “Quote and Quote-like Operators”, which reads in part:
Non-bracketing delimiters use the same character fore and aft,
but the four sorts of brackets (round, angle, square, curly)
will all nest, which means that
q{foo{bar}baz}
is the same as
'foo{bar}baz'
Note, however, that this does not always work for quoting Perl code:
$s = q{ if($a eq "}") ... }; # WRONG
That’s why you often see people use m{…} or qr{…} in Perl code, especially for multiline patterns used with /x ᴀᴋᴀ (?x). For example:
return qr{
(?= # pure lookahead for conjunctive matching
\A # always from start
. *? # going only as far as we need to to find the pattern
(?:
${case_flag}
${left_boundary}
${positive_pattern}
${right_boundary}
)
)
}sxm;
Notice how those nested braces are no problem.

Expected behavior as far as I know, otherwise how else would the compiler allow group limiters? e.g.
[a-z]{1,5}

From http://lv.php.net/manual/en/regexp.reference.delimiters.php:
If the delimiter needs to be matched
inside the pattern it must be escaped
using a backslash. If the delimiter
appears often inside the pattern, it
is a good idea to choose another
delimiter in order to increase
readability.
So this is expected behavior, not a bug.

I found that no escaping is required in this case:
'ello {world'i
(ello {world)i
So my theory is, that the problem is with the '{' delimiters only. Also, the following two produce the same error:
{ello {world}i
(ello (world)i
Using starting/ending braces as delimiters may require to escape the given braces in the expression.

Related

Comments in preg regexes using # as delimiter?

With perl like regular expression syntax, you are able to make inline comments using the /x modifier and the # character to annotate comments, but what if I'm using PHP and using # as delimiter for styling reasons, any way to make a comment then?
preg_replace("/foo # This is a comment\n/x", "bar","foobar")
works but
preg_replace("#foo # This is a comment\n#x", "bar","foobar")
doesnt work, neither does //, /**/ or any common comment sequence I tried.
In a PHP regex pattern, a delimiter has more "weight" than a pattern part. If you define a delimiter as # you cannot use it as a part of another special construct. So, "#foo # This is a comment\n#x" and "#foo (?# This is a comment\n)#x" won't work as the # signals the end of the pattern space inside the regex.
When you escape a #, it becomes a literal # symbol. The "#foo \\# This is a comment\n#x" will match "foo#Thisisacomment" as once it is escaped, it is matched as a literal symbol.
So, the best advice is available on the "Delimiters" page at php.net:
If the delimiter needs to be matched inside the pattern it must be escaped using a backslash. If the delimiter appears often inside the pattern, it is a good idea to choose another delimiter in order to increase readability.

Regular Expressions and preg_match()

I am insanely green with regular expressions. Here is what I am trying to accomplish: I want to require an 8 character password that allows upper and lowercase letters, numbers and !##$%^&*-_ characters. Here is what I have that doesn't appear to be working:
preg_match('([A-za-z0-9-_!##$%^&*]{8,})', $password)
Am I missing something really obvious?
Update: Yes, I was missing something really obvious - the open bracket [. However it still returns true when I use characters like a single quote or bracket. (Which are what I am trying to avoid.)
Basically you miss an opening [ character group bracket here:
↓
preg_match('([A-za-z0-9-_!##$%^&*()]{8,})', $password)
And you should also use delimiters. The parens will behave as such, but it's better to use a different pair to avoid ambiguity with a capture group:
preg_match('/^([A-za-z0-9-_!##$%^&*()]{8,})$/', $password)
This also adds start ^ and end $ assertions to match the whole string.
This may be easier to break into individual rules. I.e. a rule to check if the password is at least 8 characters long, another to check for an uppercase, another for lowercase, etc.
Also, it looks like you have some special characters in terms of regular expressions without any escaping. For example, the * character has special meaning in regular expressions other than some special cases. It either needs to be escaped \* or in brackets [*]

Using OR (|) with PHP Regex when ORing two expressions

I'm trying to combine two regular expressions with an OR condition in PHP so that two different string patterns can be found with one pass.
I have this pattern [\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}] which matches strings like this ${product} and ${Product:Test}.
I have this pattern [<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>] which matches strings like this <test:helloWorld /> and <calc:sum val1="10" val2="5" />.
However when I try to join the two patterns into one
[\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}]|[<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>]
so I can find all the matching strings with one call to
preg_match_all(REGEX_COMBINED, $markup, $results, PREG_SET_ORDER);
I get the following error message Unknown modifier '|'.
Can anyone please tell me where I am going wrong, I've tried multiple variations of the pattern but nothing I do seems to work.
Thanks
In PHP, regexes have to be enclosed in delimiters, like /abc/ or ~abc~. Almost any ASCII punctuation character will do; it just has to be the same character at both ends in most cases. The exception is when you use "bracketing" characters like () and <>; then they have to be correctly paired.
With your original regexes, the square brackets were being used as regex delimiters. After you glued them together it no longer worked because the compiler was still trying to use the first ] as the closing delimiter.
Another problem is that you're trying to use square brackets for grouping, which is wrong; you use parentheses for that. If you look below you'll see that I replaced square brackets with parentheses where needed, but the outermost pair I simple dropped; grouping isn't needed at that level. Then I added ~ to serve as the regex delimiter. I also added the i modifier and got rid of some clutter.
~\$?\{[\w-]+(?::[a-z]*)*\}~i
~<[a-z]+:[a-z]+\s*(?:\s[a-z]+=\"[a-z\d\s]+\"){0,5}\s*/>~i
To combine the regexes, just remove the ending ~i from the first regex and the opening ~ from the second, and replace them with a pipe:
~\$?\{[\w-]+(?::[a-z]*)*\}|<[a-z]+:[a-z]+\s*(?:\s[a-z]+=\"[a-z\d\s]+\"){0,5}\s*/>~i
Try wrapping the two conditions in an outer set of brackets "(...|...)":
([\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}]|[<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>])
Tested here and it seemed to work

What does it mean when a regular expression is surrounded by # symbols?

Question
What does it mean when a regular expression is surrounded by # symbols? Does that mean something different than being surround by slashes? What about when #x or #i are on the end? Now that I think about it, what do the surrounding slashes even mean?
Background
I saw this StackOverflow answer, posted by John Kugelman, in which he displays serious Regex skills.
Now, I'm used to seeing regexes surrounded by slashes as in
/^abc/
But he used a regex surrounded by # symbols:
'#
^%
(.{2}) # State, 2 chars
([^^]{0,12}.) # City, 13 chars, delimited by ^
([^^]{0,34}.) # Name, 35 chars, delimited by ^
([^^]{0,28}.) # Address, 29 chars, delimited by ^
\?$
#x'
In fact, it seems to be in the format:
#^abc#x
In the process of trying to google what that means (it's a tough question to google!), I also saw the format:
#^abc#i
It's clear the x and the i are not matched characters.
So what does it all mean???
Thanks in advance for any and all responses,
-gMale
The surrounding slashes are just the regex delimiters. You can use any character (afaik) to do that - the most commonly used is the /, other I've seen somewhat commonly used is #
So in other words, #whatever#i is essentially the same as /whatever/i (i is modifier for a case-insensitive match)
The reason you might want to use something else than the / is if your regex contains the character. You avoid having to escape it, similar to using '' for strings instead of "".
Found this from a "Related" link.
The delimiter can be any character that is not alphanumeric, whitespace or a backslash character.
/ is the most commonly used delimiter, since it is closely associated with regex literals, for instance in JavaScript where they are the only valid delimiter. However, any symbol can be used.
I have seen people use ~, #, #, even ! to delimit their regexes in a way that avoids using symbols that are also in the regex. Personally I find this ridiculous.
A lesser-known fact is that you can use a matching pair of brackets to delimit a regex in PHP. This has the tremendous advantage of having an obvious difference between the closing delimiter, and the symbol showing up in the pattern, and therefore don't need any escaping. My personal preference is this:
(^abc)i
By using parentheses, I remind myself that in a match, $m[0] is always the full match, and the subpatterns start at $m[1].

PHP regex delimiters, / vs. | vs. {} , what are the differences?

In the PHP manual of PCRE, http://us.php.net/manual/en/pcre.examples.php, it gives 4 examples of valid patterns:
/<\/\w+>/
|(\d{3})-\d+|Sm
/^(?i)php[34]/
{^\s+(\s+)?$}
Seems that / , | or a pair of curly braces can use as delimiters, so is there any difference between them?
No difference, except the closing delimiter cannot appear without escaping.
This is useful when the standard delimiter is used a lot, e.g. instead of
preg_match("/^http:\\/\\/.+/", $str);
you can write
preg_match("[^http://.+]", $str);
to avoid needing to escape the /.
In fact you can use any non alphanumeric delimiter (excluding whitespaces and backslashes)
"%^[a-z]%"
works as well as
"*^[a-z]*"
as well as
"!^[a-z]!"

Categories