Regular expression with exclamation marks on both sides ('!\d!') - php

I've seen the regular expression '!\d!' inside the PHP preg_match function. What the heck is this?

From the PHP PCRE docs:
When using the PCRE functions, it is required that the pattern is enclosed by delimiters. A delimiter can be any non-alphanumeric, non-backslash, non-whitespace character.
In this case, it's simply using ! as the delimiter. Often it's used if you want to use the normal delimiter within the regex itself without having to escape it. Not really necessary in this case since the rest of the regex is simply \d, but it comes in handy for things like checking that a path contains more than three directory levels. You can use either of:
/\/.*\/.*\/.*\/ blah blah blah /
or:
!/.*/.*/.*/ blah blah blah !
Now they haven't been tested thoroughly, and may not work entirely as advertised, but you should get the general idea re the minimal escaping required.
Another example (from the page linked to above) is checking if a string starts with the http:// marker. Either of these two:
/^http:\/\//
!^http://!
would suffice, but the second is easier to understand.

! is used as delimiter, \d matches the single digit.
It is the same as /[0-9]/

Related

what does # mean when used in preg_match?

This is from a class there is a # sign in preg_match what does it mean or its purpose? Does it mean a space?
if (preg_match("#Property Information </td>#",simplexml_import_dom($cols->item(0))->asXML(),$ok))
{
$table_name = 'Property Information';
}
In that case, it is being used as a pattern delimiter. As that manual page says,
When using the PCRE functions, it is required that the pattern is enclosed by delimiters. A delimiter can be any non-alphanumeric, non-backslash, non-whitespace character.
Often used delimiters are forward slashes (/), hash signs (#) and tildes (~).
It is just a delimiter. It can be any other pair of character. The following are all the same
"#Property Information </td>#"
"+Property Information </td>+"
"|Property Information </td>|"
"#Property Information </td>#"
"[Property Information </td>]"
...
The purpose of the delimiter to separate regex pattern with modifier, e.g. if you need case-insensitive match you'll put an i after the delimiter, e.g.
"#Property Information </td>#i"
"+Property Information </td>+i"
"|Property Information </td>|i"
"#Property Information </td>#i"
"[Property Information </td>]i"
...
See http://www.php.net/manual/en/regexp.reference.delimiters.php for detail.
Almost any character - when appearing at the first position - can be used as a PCRE delimiter. In this case it's the # (another common one would be / but when dealing with closing tags that one is not really good as you'd have to escape every / in the text
)
See http://www.php.net/manual/en/regexp.reference.delimiters.php for details.
However, you shouldn't use a Regex for this check at all - you are just testing if a plain string is in another string. Here's a proper solution:
$xml = simplexml_import_dom($cols->item(0))->asXML()
if(strpos($xml, 'Property Information </td>') !== false) { ... }
Actually, using string operators when dealing with html/xml is not really nice but if you are just doing simple "contains" checks it's usually the easiest way.
every regular expression must start and end with the same character. the author of the given regular expression has chosen to start and end the regular expression with an # sign.

Php regex with safe delimiters

I've thought that php's perl compatible regular expression (preg library) supports curly brackets as delimiters. This should be fine:
{ello {world}i // should match on Hello {World
The main point of curly brackets is that it only takes the most left and right ones, thus requiring no escaping for the inner ones. As far as I know, php requires the escaping
{ello \{world}i // this actually matches on Hello {World
Is this the expected behavior or bug in php preg implementation?
When in Perl you use for the pattern delimiter any of the four paired ASCII bracket types, you only need to escape unpaired brackets within the pattern. This is indeed the entire purpose of using brackets. This is documented in the perlop manpage under “Quote and Quote-like Operators”, which reads in part:
Non-bracketing delimiters use the same character fore and aft,
but the four sorts of brackets (round, angle, square, curly)
will all nest, which means that
q{foo{bar}baz}
is the same as
'foo{bar}baz'
Note, however, that this does not always work for quoting Perl code:
$s = q{ if($a eq "}") ... }; # WRONG
That’s why you often see people use m{…} or qr{…} in Perl code, especially for multiline patterns used with /x ᴀᴋᴀ (?x). For example:
return qr{
(?= # pure lookahead for conjunctive matching
\A # always from start
. *? # going only as far as we need to to find the pattern
(?:
${case_flag}
${left_boundary}
${positive_pattern}
${right_boundary}
)
)
}sxm;
Notice how those nested braces are no problem.
Expected behavior as far as I know, otherwise how else would the compiler allow group limiters? e.g.
[a-z]{1,5}
From http://lv.php.net/manual/en/regexp.reference.delimiters.php:
If the delimiter needs to be matched
inside the pattern it must be escaped
using a backslash. If the delimiter
appears often inside the pattern, it
is a good idea to choose another
delimiter in order to increase
readability.
So this is expected behavior, not a bug.
I found that no escaping is required in this case:
'ello {world'i
(ello {world)i
So my theory is, that the problem is with the '{' delimiters only. Also, the following two produce the same error:
{ello {world}i
(ello (world)i
Using starting/ending braces as delimiters may require to escape the given braces in the expression.

Error in regexp php

There is a mistake in this code, I could not find it. What is the missing character do I need?
preg_replace(/<(?!\/?(?:'.implode('|',$white).'))[^\s>]+(?:\s(?:(["''])(?:\\\1|[^\1])*?\1|[^>])*)?>/','',$html);
It looks like among other things you're missing a single quote:
preg_replace('/<(?!\/?(?:' . implode('|',$white) . '))[...
^
here!
Also, since the pattern contains single-quotes, those would also have to be escaped by preceding with backslash.
Alternatively you could also use heredoc syntax; this would not require any escaping of quotes in the pattern, and expressions can be embedded for expansion.
$pattern = <<<EOD
/pattern{embeddedExpression}morePattern/
EOD;
... preg_replace($pattern, ...)
Do yourself a favor and use DOM and XPath instead of regex to parse HTML to avoid problems.
Well, this part is wrong:
(["'])(?:\\\1|[^\1])*?\1
That's supposed to match a sequence enclosed in single- or double quotes, possibly including backslash-escaped quotes. But it won't work because backreferences don't work in character classes. The \1 is treated as the number 1 in octal notation, so [^\1] matches any character except U+0001.
If it seems to work most of the time, it's because of the reluctant quantifier (*?). The first alternative in (?:\\\1|[^\1])*? correctly consumes an escaped quote, but otherwise it just matches any character, reluctantly, until it sees an unescaped quote. It works okay on well-formed text, but toss in an extra quote and it goes haywire.
The correct way to match "anything except what group #1 captured" is (?:(?!\1).)* - that is, consume one character at a time, but only after the lookahead confirms that it's not the first character of the captured text. But I think you'll be better off dealing with each kind of quote separately; this regex is complicated enough as it is.
'~<(?!/?+(?:'.implode('|',$white).')\b)[^\s>]++(?:\s++'.
'(?:[^\'">]++|"(?:[^"\\]++|\\")*+"|\'(?:[^\'\\]++|\\\')*+\')*+)?+>~'
Notice the addition of the \b (word boundary) after the whitelist alternation. Without that, if you have (for example) <B> in your list, you'll unintentionally whitelist <BODY> and <BLOCKQUOTE> tags as well.
I also used possessive quantifiers (*+, ++, ?+) everywhere, because the way this regex is written, I know backtracking will never be useful. If it's going to fail, I want it to fail as quickly as possible.
Now that I've told you how to get the regex to work, let me urge you not to use it. This job is too complex and too important to be done with such a poorly suited tool as regex. And if you really got that regex from a book on PHP security, I suggest you get your money back.

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx
Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.
You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.
See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Simple RegEx PHP

I have a string and I need to see if it contains the following "_archived".
I was using the following:
preg_match('(.*)_archived$',$string);
but get:
Warning: preg_match() [function.preg-match]: Unknown modifier '_' in /home/storrec/classes/class.main.php on line 70
I am new to Regular Expressions so this is probably very easy.
Or should I be using something a lot simpler like
strstr($string, "_archived");
Thanks in advance
strstr($string, "_archived");
Is going to be way easier for the problem you describe.
As is often quoted
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
strstr is enough in this case, but to solve your problem, you need to add delimiters to your regex. A delimiter is a special character that starts and ends the regex, like so:
preg_match('/_archived/',$string);
The delimiter can be a lot of different characters, but usual choices are /, # and !. From the PHP manual:
Any character can be used for delimiter as long as it's not alphanumeric, backslash (), or the null byte. If the delimiter character has to be used in the expression itself, it needs to be escaped by backslash. Since PHP 4.0.4, you can also use Perl-style (), {}, [], and <> matching delimiters.
Read all about PHP regular expression syntax here.
You can see some examples of valid (and invalid) patterns in the PHP manual here.
You just need some delimiters, e.g. enclose the pattern with /
preg_match('/_archived$/',$string);
Perl regexes let you use any delimiter, which is handy if your regex uses / a lot. I often find myself using braces for example:
preg_match('{_archived$}',$string);
Also, note that you don't need the (.*) bit as you aren't capturing the bit before "_archived", you're just testing to see if the string ends with it (that $ symbol on the end matches the end of the string)
If all you're looking for is if a string contains a string, then by all means use the simple version. But you can also simply do:
preg_match('/_archived/', $string);
Try:
preg_match('/(.*)_archived$/',$string);
If you are only checking if the string exists in $string, strstr should be enough for you though.
strstr() or strpos() would be better for finding something like this, I would think. The error you're getting is due to not having any delimiters around your regular expression. Try using
"/(.*)_archived$/"
or
"#(.*)_archived$#".
... is probably the best way to implement this. You're right that a regex is overkill.
When you have a specific string you're looking for, using regular expressions in a bit of overkill. You'll be fine using one of PHP's standard string search functions here.
The strstr function will work, but conventional PHP Wisdom (read: myth, legend, superstition, the manual) says that using the strpos function will yield better performance. i.e., something like
if(strpos($string, '_archived') !== false) {
}
You'll want to use the true inequality operator here, as strpos returns "0" when it finds a needle at the start of it's haystack. See the manual for more information on this.
As to your problem, PHP's regular expression engine expects you to enclose your regular expression string with a set of delimiters, which can be one of a number of different characters ( "/" or "|" or "(" or "{" or "#", or ...). The PHP engine thinks you want a regular expression of
.*
with a set of pattern modifiers that are
_archived$
So, in the future, when you use a regular expression, try something like
//equivilant
preg_match('/(.*)_archived$/i',$string);
preg_match('{(.*)_archived$}i',$string);
preg_match('#(.*)_archived$#i',$string);
The "/", "#", and "{}" characters are your delimiters. They are not used in the match, they're used to tell the engine "anything in-between these characters is my reg ex. The "i" at the end is a pattern modifiers that says to be case insensitive. It's not necessary, but I include it here so you can see what a pattern modifier looks like, and to help you understand why you need delimiters in the first place.
You don't have to remember about delimiters if you are using T-Regx
pattern('_archived$')->matches($string)

Categories