Explain why regular expression is too large PHP/PCRE [duplicate] - php

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16

The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.

Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length

While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.

For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)

Related

What can NOT be described by a PCRE regex?

I am using a lot of regular expressions and stumbled over the question what actually can not be described by a regex.
First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Can anybody explain what is the formal limit of a regex? I know this might be a rather academic question but I'm a curious person ;-)
Edit
As I am a php guy I am especially interested in regex described by PCRE standard as described here: http://php.net/manual/en/reference.pcre.pattern.syntax.php
I know that PCRE allows a lot of things that are not part of the original regular expressions like back references.
Mathcing of balanced parentheses seems to be one example that can not be matched by regular expressions in general but it can be matched using PCRE (see http://sandbox.onlinephpfunctions.com/code/fd12b580bb9ad7a19e226219d5146322a41c6e47 for live example):
$data = array('()', '(())', ')(', '(((()', '(((((((((())))))))))', '()()');
$regex = '/^((?:[^()]|\((?1)\))*+)$/';
foreach($data as $d) {
echo "$d matched by regex: " . (preg_match($regex, $d) ? 'yes' : 'no') . "\n";
}
First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Yes, that can be done.
To match a non-empty sequence of x's, followed by a greater number of o's, we can use an approach similar to that of the balanced-parentheses regex:
(x(?1)?o)o+
To match a string of x's and o's such that any sequence of x's is followed by a longer sequence of o's (except optionally at the very end), we can build on pattern #1:
^o*(?:(x(?1)?o)o+)*x*$
And of course, we'll also need a variant of pattern #2 with x's and o's flipped:
^x*(?:(o(?1)?x)x+)*o*$
To match a string of x's and o's that meet both of the above criteria, we can convert pattern #2 to a positive lookahead assertion, and renumber the capture-group in pattern #3:
^(?=o*(?:(x(?1)?o)o+)*x*$)x*(?:(o(?2)?x)x+)*o*$
As for the main question . . . I'm confident that a PCRE can match any context-free language, since the support for (?n) outside of the nth capture-group means that you can basically create a subroutine for each of your non-terminals. For example, this context-free grammar:
S → aTb
S → ε
T → cSd
T → eTf
can be written as:
capture-group #1 (S) → (a(?2)b|)
capture-group #2 (T) → (c(?1)d|e(?2)f)
To assemble that into a single regex, we can just concatenate them all, but appending {0} after all but the start non-terminal, and then add ^ and $:
^(a(?2)b|)(c(?1)d|e(?2)f){0}$
But as you saw from your first example, PCREs can match some non-context-free languages as well. (Another example is anbncn, which is a classic example of a non-context-free language. You can match it with PCRE by combining a PCRE for anbncm with a PCRE for ambncn using a forward lookahead assertion. Although the intersection of two regular languages is necessarily regular, the intersection of two context-free languages is not necessarily context-free; but the intersection of the languages defined by two PCREs can be defined by a PCRE.)
The set of all languages that can be recognized by a regular expression is called, not surprisingly, "regular languages".
The next most complicated languages are the context-free languages. They cannot be parsed by any regular expression. The standard example is "all balanced parentheses" -- so "()()" and "(())" but not "(()".
Another good example of a context-free language is HTML.
I don't have definitive evidence that any of these are impossible with things like recursion, balancing groups, self-referencing groups, and appending text to the string being tested. I would be glad to be proven wrong on any or all of these, as I would learn something!
It's pretty bad at math.
For example, I do not believe it is possible using PCRE, to detect a sequence of numbers that is ascending: that is, to match "1 2 7 97 315 316..." but not "127 97 315 316..."
I'm not sure it's possible to even match a sequence contiguously increasing from 1, like "1 2 3...", without exhaustively listing all possibilities like /1( 2( 3(...)?)?)?/ up to the max length you wish to check.
Thee are hacks to make it work by adding known text to the string under test (eg http://www.rexegg.com/regex-trick-line-numbers.html works by adding a series of numbers to the end of the file). But as raw regex, simple math is only possible by brute-forcing.
Another example which I believe it will fail at is "match any sequence which sums to N".
So for N=4, it should match 4, 3 1, 1 3, 2 2, 1 1 1 1, 2 1 1, 1 2 1, 1 1 2, 1 1 1 1, which looks like you could brute-force it, until you realize it also has to match 4 -12 11 0 1.
In the same manner, I don't think you could have it analyze an equation using SI units, and verify whether the units balanced on both sides of the equation. For example, "10N=2kg*5ms^-1". Never mind checking the values, just checking the units are correct.
Then there're all the classes of problems that no computer program can currently accomplish, such as "check if a string has been correctly title cased in English" which would require a context-sensitive natural language parser to correctly detect the different senses of "like" in "Time Flies like an Arrow But Fruit Flies Like a Banana".

PHP - Why am I being warned that my regular expression is too large?

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.
Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length
While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.
For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)

preg_match appears to hit a limit when using two matches

I have run up against an odd problem. it appears i am reaching some sort of limit with preg_replace while trying to use two matches using php-5.3.3
// works fine
$pattern_1 = '?START(.*)STOP?';
$string = 'START' . str_repeat('x',9999999) . 'STOP' ;
preg_match($pattern_1, $string , $matchedArray ) ;
$pattern_2 = '?START-ONE(.*)STOP-ONE.*START-TWO(.*)STOP-TWO.*?';
// works fine
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49970) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_2 ) ;
// didnt work
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49971) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_3 ) ;
The first option with only one match uses a very large string and has no problems.
The second option has a string length of 50,026 and works fine. the last option has a string length of 50,027 (one more) and the match no longer works. since the 49971 number can vary when the error occurs, it could be changed to something larger to simulate the problem.
Any ideas or thoughts? perhaps is this a php version issue? maybe a possible workaround is merely to only use one match rather than two and then run preg_match it twice ?
Ok, PHP's not very talkative about regex errors, it just returns false for the last case, which simply tells than an error occured, per the PHP docs.
I've reproduced the problem using PCRE (the regex engine used by preg_match) in C# (but with a much higher character count), and the error I'm getting is PCRE_ERROR_MATCHLIMIT.
This means you're hitting the backtracking limit set in PCRE. It's just a safety measure to prevent the engine from looping indefinitely, and I think your PHP configuration sets it to a low value.
To fix the issue, you can set a higher value for the pcre.backtrack_limit PHP option which controls this limit:
ini_set("pcre.backtrack_limit", "10000000"); // Actually, this is PCRE's default
On a side note:
You probably should replace (.*) with (.*?) to get less useless backtracking and for correctness (otherwise the regex engine will get past the STOP string and will have to backtrack to reach it)
Using ? as a pattern delimiter is a bad idea since it prevents you from using the ? metacharacter and therefore applying the above advice. Really, you should never use regex metacharacters as pattern delimiters.
If you're interested in more low-level details, here's the relevant bit of the PCRE docs (emphasis mine):
The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats.
Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.
When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is executed is entirely different. However, there is still the possibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue.
The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
A value for the match limit may also be supplied by an item at the start of a pattern of the form
(*LIMIT_MATCH=d)
where d is a decimal number. However, such a setting is ignored unless d is less than the limit set by the caller of pcre_exec() or, if no such limit is set, less than the default.

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):
~^(a(?1)?b)$~
My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?
Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:
$regex = '~^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$~x';
var_dump(preg_match($regex, 'aabbcc')); // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc')); // 0
var_dump(preg_match($regex, 'aaaccc')); // 0
var_dump(preg_match($regex, 'aabcc')); // 0
var_dump(preg_match($regex, 'abbcc')); // 0
Try it yourself: http://codepad.viper-7.com/1erq9v
Explanation
If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:
~^a+(b(?-1)?c)$~
This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.
This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.
Conclusion
I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"
Here is an alternative solution using balancing groups with .NET regex:
^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$
Not PCRE, but may be of interest.
Example at: http://ideone.com/szhuE
Edit: Added the missing balancing check for the group a, and an online example.
My question is: How far can you go?
In the interests of not creating code that is an unreadable morass of punctuation, I'm going to risk the downvotes and answer a different, though very much related, question: how far should you go?
Regular expression parsers are a brilliant thing to have in your toolkit but they are not the be all and end all of programming. The ability to write parsers in a readable manner is also a brilliant thing to have in your toolkit.
Regular expressions should be used right up to the point where they start making your code hard to understand. Beyond that, their value is dubious at best, damaging at worst. For this specific case, rather than using something like the hideous:
~^(?=(a(?-1)?b)c)a+(b(?-1)?c)$~x
(with apologies to NikiC), which the vast majority of people trying to maintain it are either going to have to replace totally or spend substantial time reading up on and understanding, you may want to consider something like a non-RE, "proper-parser" solution (pseudo-code):
# Match "aa...abb...bcc...c" where:
# - same character count for each letter; and
# - character count is one or more.
def matchABC (string str):
# Init string index and character counts.
index = 0
dim count['a'..'c'] = 0
# Process each character in turn.
for ch in 'a'..'c':
# Count each character in the subsequence.
while index < len(str) and str[index] == ch:
count[ch]++
index++
# Failure conditions.
if index != len(str): return false # did not finish string.
if count['a'] < 1: return false # too few a characters.
if count['a'] != count['b']: return false # inequality a and b count.
if count['a'] != count['c']: return false # inequality a and c count.
# Otherwise, it was okay.
return true
This will be far easier to maintain in the future. I always like to suggest to people that they should assume those coming after them (who have to maintain the code they write) are psychopaths who know where you live - in my case, that may be half right, I have no idea where you live :-)
Unless you have a real need for regular expressions of this kind (and sometimes there are good reasons, such as performance in interpreted languages), you should optimise for readability first.
Qtax Trick
A solution that wasn't mentioned:
^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$
See what matches and fails in the regex demo.
This uses self-referencing groups (an idea #Qtax used on his vertical regex).

Compilation failed: POSIX collating elements are not supported

I've just installed a website & legacy CMS onto our server and I'm getting a POSIX compilation error. Luckily it's only appearing in the backend however the client's keen to get rid of it.
Warning: preg_match_all() [function.preg-match-all]: Compilation failed:
POSIX collating elements are not supported at offset 32 in
/home/kwecars/public_html/webEdition/we/include/we_classes/SEEM/we_SEEM.class.php
on line 621
From what I can tell it's the newer version of PHP causing the issue. Here's the code:
function getAllHrefs($code){
$trenner = "[\040|\n|\t|\r]*";
$pattern = "/<(a".$trenner."[^>]+href".$trenner."[=\"|=\'|=\\\\|=]*".$trenner.")
([^\'\">\040? \\\]*)([^\"\' \040\\\\>]*)(".$trenner."[^>]*)>/sie";
preg_match_all($pattern, $code, $allLinks); // ---- line 621
return $allLinks;
}
How can I tweak this to work on the newer version of php on this server?
Thanks in advance, my voodoo just isn't strong enough ;)
Your error message that “POSIX collating elements are not supported” deserves some explanation. After all, what in the world is a POSIX collating element anyway, and how can I avoid it?
The short answer is that you have an equals sign inside your square brackets in a place where its use is reserved for future use, assuming we ever get around to implementing it, which is anything but certain. You can tickle this in Perl on the command line this way, which gives a much better error message than PHP is providing:
% perl -le 'print "abc" =~ /[=foo=]/ || "Fail"'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[=foo=] <-- HERE / at -e line 1.
That’s the short answer; the longer answer follows.
Fancy POSIX Character Classes
Inside a square bracketed character class, POSIX admits three different nestedbracketed forms, all indicated using an extra symbol inside the brackets in pairs:
Named POSIX character classes, which are basically like Unicode properties, use an extra colon flanking: [:PROPERTY:], as in [:alpha:].
Collating elements intended to be treated as equivalent to each other, use an extra equals sign flanking them: [=ELEMENTS=], as in [=eéèëê=] in English or French, and [=vw=] in Swedish.
Polygraphs (digraphs, trigraphs, tetragraphs, etc), which are multicharacter elements meant to count as a single character, have an extra dot flanking them: [.DIGRAPH.], as in [.ch.] or [.ll.] per the traditional Spanish alphabet. These are sometimes known as contractions because two or more code points count as though that sequence were a single code point.
Perl supports only the first of these, not the second and third.
They are all awkward to use, because they must be nested inside an extra set of brackets, as in [[:punct:] to mean \pP or \p{punct}. You only need extra braces with Unicode properties when you are selecting one of many, as in [\pL\pN\pM\p{Pc}].
The Intent
The other two were an attempt to support locale-specific linguistic elements in a pre‐Unicode enviornment under legacy 8‑bit locales. For example, to express the traditional Spanish alphabet, which counts acute accents over vowels and diaereses over u’s as the same letter yet which counts a tilde over an n as a different letter altogether, and which furthermore has two digraphs each counting as a distinct letter, you would have to write this in POSIX:
[[=aá=]bc[.ch.]d[=eé=]fgh[=ií=]jkl[.ll.]mnñ[=oó=]pqrst[=uúü=]vwxyz]
You can and sometimes much combine these. For example, in German phonebooks where the three i‑mutated vowels can be spelt without diacritics by inserting a following e:
[a[=ä[.ae.]=]bcdefghijklmno[=ö[.oe.]=]pqrs[=ß[.ss.]=]tu[=ü[.ue.]=]vwxyz]
That way, assuming $ES and $DE are those languages’ respective alphabets, you could say something like
[$ES]{4}
and have it match words like guía, niño, llave, and choco in Spanish; or in German have
[$DE]{6}
and have it match words like tschüß or its uppercase undiacriticked equivalent, TSCHUESS.
The Unicode Way
This is awkward for various reasons, and not just those that are obvious from the two alphabets listed above. It does not admit the notion of combining characters, so you have to add those explicitly for non-normalized text, as in [=e\xE9[.e\x{301.]=].
Unicode has taken another path in how to implement linguistic elements like this. Fortunately, Unicode regular expressions per UTS#18 do not need to support language features tailored for specific languages or locales until Level 3. This is something no one yet has yet implemented.
Note that having SS and ß have the same casefold is not considered a locale tailoring. It is the full casefold for that code point no matter the linguistic context. So those are the same when case is ignored. Strange but true. Given that ß is code point U+00DF, we see that these are the same no matter the locale:
$ perl5.14.0 -E 'say "SS" =~ /^\xDF$/i ? "Pass" : "Fail"'
Pass
$ perl5.14.0 -E 'say "\xDF" =~ /^SS$/i ? "Pass" : "Fail"'
Pass
Although locale tailoring for patterns is still beyond us, collation has been implemented, including with locale support, and you can access it from Perl just fine.
However, PHP does not yet support Unicode collation.
References for Unicode collation include:
ICU’s Collation Concepts document
UTS#10: Unicode Collation Algorithm
Perl’s Unicode::Collate module.
Perl’s Unicode::Collate::Locale module.
[...] are character classes, they match any character between the brackets, you don't have to add | between them. See character classes.
So [abcd] will match a or b or c or d.
If you want to match alternations of more than one character, for example red or blue or yellow, use a sub pattern:
"(red|blue|yellow)"
And you guessed, [abcd] is equivalent to (a|b|c|d).
So here is what you could do for your regex:
For
$trenner = "[\040|\n|\t|\r]*";
Write this instead:
$trenner = "[\040\n\t\r]*";
And for
"[=\"|=\'|=\\\\|=]"
You could do
"(=\"|=\'|=\\\\|=)"
Or
"=[\"'\\\\]?"
BTW you could use \s instead of $trenner (see http://www.php.net/manual/en/regexp.reference.escape.php)

Categories