PHP regex delimiters, / vs. | vs. {} , what are the differences? - php

In the PHP manual of PCRE, http://us.php.net/manual/en/pcre.examples.php, it gives 4 examples of valid patterns:
/<\/\w+>/
|(\d{3})-\d+|Sm
/^(?i)php[34]/
{^\s+(\s+)?$}
Seems that / , | or a pair of curly braces can use as delimiters, so is there any difference between them?

No difference, except the closing delimiter cannot appear without escaping.
This is useful when the standard delimiter is used a lot, e.g. instead of
preg_match("/^http:\\/\\/.+/", $str);
you can write
preg_match("[^http://.+]", $str);
to avoid needing to escape the /.

In fact you can use any non alphanumeric delimiter (excluding whitespaces and backslashes)
"%^[a-z]%"
works as well as
"*^[a-z]*"
as well as
"!^[a-z]!"

Related

Why are backslashes used in preg_match function of PHP?

I'm been practicing the preg_match() function in PHP. The tutorial said that it is needed to add fore slashes before the characters.
I also noticed that without the slashes, it works strangely. It gives a warning:
preg_match(): Delimiter must not be alphanumeric or backslash.
Q: What difference does the fore slashes do?
Here's the code:
$string = 'Okay, I\'m fine with it! ';
$math = 'Okay'; // I need to add fore slashes for it to work
echo preg_match($math, $string); // It supposedly echoes out 1 or 0
// depending if the former argument
// is in the latter argument
There is no particular reason, it's a syntaxic choice. This syntax has the avantage to be handy to add global modifiers to the pattern:
delimiter - pattern - delimiter - [global modifiers]
As explained in the error message and in the php manual, you can choose the delimiter between special characters, the most commonly used is the slash, but it's not always a pertinent choice in particular when the pattern contains a lot of literal slashes that need to be escaped.
It's because you can also apply switches to the regular expression (eg. m for multiline, u for Unicode) and these need to be defined outside of the delimiter, so the syntax is
opening delimiter expression closing delimiter [optional switches]
e.g.
/^[a-z]*$/mi
for the multiline (m) and case insensitive (i) switches, using a delimiter of /
The delimiter must not be a character that can be misinterpreted by the regexp parser, it must be very clear that it is a delimiter, so it cannot be alpha (e.g. i, or a \ that is used to "escape" characters in the regexp
Note that you can also use braces as delimiters, so
[^[a-z]*$]mi
is valid

rexexp solution for php

I have tried to work this out myself (even bought a Kindle book!), but I am struggling with backreferences in php.
What I want is like the following example:
var $html = "hello %world|/worldlink/% again";
output:
hello world again
I tried stuff like:
preg_replace('/%([a-z]+)|([a-z]+)%/', '\1', $html);
but with no joy.
Any ideas please? I am sure someone will post the exact answer but I would like an explanation as well please - so that I don't have to keep asking these questions :)
The slashes "/" are not included in your allowed range [a-z]. Instead use
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
Your expression:
'/%([a-z]+)|([a-z]+)%/'
Is only capturing one thing. The | in the middle means "OR". You're trying to capture both, so you don't need an OR in there. You want a literal | symbol so you need to escape it:
'/%([a-z]+)\|([a-z\/]+)%/'
The / character also needs to be included in your char set, and escaped as above.
Your regex (/%([a-z]+)|([a-z]+)%/) reads this way:
Match % followed by + (= one or
more) a-z characters (and store this
into backreference #1).
Or (the |):
Match + (= one or more) a-z
characters (and store this into
backreference #2) followed by a
%.
What you are looking for is:
preg_replace('~%([a-z]+)[|]([a-z/]+)%~', '$1', $html);
Basically I just escaped the | regex meta character (you can do this by either surrounding it with [] like I did or just prepending a backwards slash \, personally I find the former easier to read), and added a / to the second capture group.
I also changed your delimiters from / to ~ because tildes are much more unlikely to appear in strings, if you want to keep using / as your delimiter you also have to escape their occurrences in your regex.
It's also recommended that you use the $ syntax instead of \ in your replacement backreferences:
$replacement may contain references
of the form \\n or (since PHP 4.0.4)
$n, with the latter form being the
preferred one.
Here is a version that works according to the OPs data/information provided (using a non-slash delimiter to avoid escaping slashes):
preg_replace('#%([a-z]+)\|([a-z/]+)%#', '\1', $html);
Using a non slash delimiter, would alleviate the need to escape slashes.
Outputs:
hello world again
The Explanation
Why yours did not work. First up the | is an OR operator, and, in your example, should be escaped. Second up, since you are using /'s or expect slashes it is better to use a non-slash delimiter, such as #. Third up, the slash needed to be added to list of allowed matches. As stated before you may want to include a bit more options, as any type of word with numbers underscores periods hyphens will fail / break the script. Hopefully that is the explanation you were looking for.
Here's what works for me:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
Your regular expression doesn't escape the |, and doesn't include the proper characters for the URL.
Here's a basic live example supporting only a-z and slashes:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
In reality, you're going to want to change those [a-z]+ blocks to something more expressive. Do some searches for URL-matching regular expressions, and pick one that fits what you want.
$html = "hello %world|/worldlink/% again";
echo preg_replace('/([A-ZA-z_ ]*)%(.+)\|(.+)%([A-ZA-z_ ]*)/', '$1$2$4', $html);
output:
hello world again
here is a working code : http://www.ideone.com/0qhZ8

Php regex with safe delimiters

I've thought that php's perl compatible regular expression (preg library) supports curly brackets as delimiters. This should be fine:
{ello {world}i // should match on Hello {World
The main point of curly brackets is that it only takes the most left and right ones, thus requiring no escaping for the inner ones. As far as I know, php requires the escaping
{ello \{world}i // this actually matches on Hello {World
Is this the expected behavior or bug in php preg implementation?
When in Perl you use for the pattern delimiter any of the four paired ASCII bracket types, you only need to escape unpaired brackets within the pattern. This is indeed the entire purpose of using brackets. This is documented in the perlop manpage under “Quote and Quote-like Operators”, which reads in part:
Non-bracketing delimiters use the same character fore and aft,
but the four sorts of brackets (round, angle, square, curly)
will all nest, which means that
q{foo{bar}baz}
is the same as
'foo{bar}baz'
Note, however, that this does not always work for quoting Perl code:
$s = q{ if($a eq "}") ... }; # WRONG
That’s why you often see people use m{…} or qr{…} in Perl code, especially for multiline patterns used with /x ᴀᴋᴀ (?x). For example:
return qr{
(?= # pure lookahead for conjunctive matching
\A # always from start
. *? # going only as far as we need to to find the pattern
(?:
${case_flag}
${left_boundary}
${positive_pattern}
${right_boundary}
)
)
}sxm;
Notice how those nested braces are no problem.
Expected behavior as far as I know, otherwise how else would the compiler allow group limiters? e.g.
[a-z]{1,5}
From http://lv.php.net/manual/en/regexp.reference.delimiters.php:
If the delimiter needs to be matched
inside the pattern it must be escaped
using a backslash. If the delimiter
appears often inside the pattern, it
is a good idea to choose another
delimiter in order to increase
readability.
So this is expected behavior, not a bug.
I found that no escaping is required in this case:
'ello {world'i
(ello {world)i
So my theory is, that the problem is with the '{' delimiters only. Also, the following two produce the same error:
{ello {world}i
(ello (world)i
Using starting/ending braces as delimiters may require to escape the given braces in the expression.

why does this regex fail in PHP?

I got the expression directly from RegExr, but PHP has a problem with the =
"/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/"
The expression is for matching email addresses.
You used / as the delimiter marking the start and end of the pattern, but then also used that character within the pattern. You must either use a different delimiter, or escape instances of it within the pattern. If you meant to escape the equals signs, then you used the wrong slash.
Escape the slash preceding the = (and the other slash in that expression). You use / as a delimiter, therefore if it occurs inside the pattern it has to be escaped.
"/[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/"
should work, then.
You are using / as delimiter. There are two / in the regex which are not escaped. Escape them as \/:
"/[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/"
^^ ^^

What does it mean when a regular expression is surrounded by # symbols?

Question
What does it mean when a regular expression is surrounded by # symbols? Does that mean something different than being surround by slashes? What about when #x or #i are on the end? Now that I think about it, what do the surrounding slashes even mean?
Background
I saw this StackOverflow answer, posted by John Kugelman, in which he displays serious Regex skills.
Now, I'm used to seeing regexes surrounded by slashes as in
/^abc/
But he used a regex surrounded by # symbols:
'#
^%
(.{2}) # State, 2 chars
([^^]{0,12}.) # City, 13 chars, delimited by ^
([^^]{0,34}.) # Name, 35 chars, delimited by ^
([^^]{0,28}.) # Address, 29 chars, delimited by ^
\?$
#x'
In fact, it seems to be in the format:
#^abc#x
In the process of trying to google what that means (it's a tough question to google!), I also saw the format:
#^abc#i
It's clear the x and the i are not matched characters.
So what does it all mean???
Thanks in advance for any and all responses,
-gMale
The surrounding slashes are just the regex delimiters. You can use any character (afaik) to do that - the most commonly used is the /, other I've seen somewhat commonly used is #
So in other words, #whatever#i is essentially the same as /whatever/i (i is modifier for a case-insensitive match)
The reason you might want to use something else than the / is if your regex contains the character. You avoid having to escape it, similar to using '' for strings instead of "".
Found this from a "Related" link.
The delimiter can be any character that is not alphanumeric, whitespace or a backslash character.
/ is the most commonly used delimiter, since it is closely associated with regex literals, for instance in JavaScript where they are the only valid delimiter. However, any symbol can be used.
I have seen people use ~, #, #, even ! to delimit their regexes in a way that avoids using symbols that are also in the regex. Personally I find this ridiculous.
A lesser-known fact is that you can use a matching pair of brackets to delimit a regex in PHP. This has the tremendous advantage of having an obvious difference between the closing delimiter, and the symbol showing up in the pattern, and therefore don't need any escaping. My personal preference is this:
(^abc)i
By using parentheses, I remind myself that in a match, $m[0] is always the full match, and the subpatterns start at $m[1].

Categories