What can NOT be described by a PCRE regex? - php

I am using a lot of regular expressions and stumbled over the question what actually can not be described by a regex.
First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Can anybody explain what is the formal limit of a regex? I know this might be a rather academic question but I'm a curious person ;-)
Edit
As I am a php guy I am especially interested in regex described by PCRE standard as described here: http://php.net/manual/en/reference.pcre.pattern.syntax.php
I know that PCRE allows a lot of things that are not part of the original regular expressions like back references.
Mathcing of balanced parentheses seems to be one example that can not be matched by regular expressions in general but it can be matched using PCRE (see http://sandbox.onlinephpfunctions.com/code/fd12b580bb9ad7a19e226219d5146322a41c6e47 for live example):
$data = array('()', '(())', ')(', '(((()', '(((((((((())))))))))', '()()');
$regex = '/^((?:[^()]|\((?1)\))*+)$/';
foreach($data as $d) {
echo "$d matched by regex: " . (preg_match($regex, $d) ? 'yes' : 'no') . "\n";
}

First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Yes, that can be done.
To match a non-empty sequence of x's, followed by a greater number of o's, we can use an approach similar to that of the balanced-parentheses regex:
(x(?1)?o)o+
To match a string of x's and o's such that any sequence of x's is followed by a longer sequence of o's (except optionally at the very end), we can build on pattern #1:
^o*(?:(x(?1)?o)o+)*x*$
And of course, we'll also need a variant of pattern #2 with x's and o's flipped:
^x*(?:(o(?1)?x)x+)*o*$
To match a string of x's and o's that meet both of the above criteria, we can convert pattern #2 to a positive lookahead assertion, and renumber the capture-group in pattern #3:
^(?=o*(?:(x(?1)?o)o+)*x*$)x*(?:(o(?2)?x)x+)*o*$
As for the main question . . . I'm confident that a PCRE can match any context-free language, since the support for (?n) outside of the nth capture-group means that you can basically create a subroutine for each of your non-terminals. For example, this context-free grammar:
S → aTb
S → ε
T → cSd
T → eTf
can be written as:
capture-group #1 (S) → (a(?2)b|)
capture-group #2 (T) → (c(?1)d|e(?2)f)
To assemble that into a single regex, we can just concatenate them all, but appending {0} after all but the start non-terminal, and then add ^ and $:
^(a(?2)b|)(c(?1)d|e(?2)f){0}$
But as you saw from your first example, PCREs can match some non-context-free languages as well. (Another example is anbncn, which is a classic example of a non-context-free language. You can match it with PCRE by combining a PCRE for anbncm with a PCRE for ambncn using a forward lookahead assertion. Although the intersection of two regular languages is necessarily regular, the intersection of two context-free languages is not necessarily context-free; but the intersection of the languages defined by two PCREs can be defined by a PCRE.)

The set of all languages that can be recognized by a regular expression is called, not surprisingly, "regular languages".
The next most complicated languages are the context-free languages. They cannot be parsed by any regular expression. The standard example is "all balanced parentheses" -- so "()()" and "(())" but not "(()".
Another good example of a context-free language is HTML.

I don't have definitive evidence that any of these are impossible with things like recursion, balancing groups, self-referencing groups, and appending text to the string being tested. I would be glad to be proven wrong on any or all of these, as I would learn something!
It's pretty bad at math.
For example, I do not believe it is possible using PCRE, to detect a sequence of numbers that is ascending: that is, to match "1 2 7 97 315 316..." but not "127 97 315 316..."
I'm not sure it's possible to even match a sequence contiguously increasing from 1, like "1 2 3...", without exhaustively listing all possibilities like /1( 2( 3(...)?)?)?/ up to the max length you wish to check.
Thee are hacks to make it work by adding known text to the string under test (eg http://www.rexegg.com/regex-trick-line-numbers.html works by adding a series of numbers to the end of the file). But as raw regex, simple math is only possible by brute-forcing.
Another example which I believe it will fail at is "match any sequence which sums to N".
So for N=4, it should match 4, 3 1, 1 3, 2 2, 1 1 1 1, 2 1 1, 1 2 1, 1 1 2, 1 1 1 1, which looks like you could brute-force it, until you realize it also has to match 4 -12 11 0 1.
In the same manner, I don't think you could have it analyze an equation using SI units, and verify whether the units balanced on both sides of the equation. For example, "10N=2kg*5ms^-1". Never mind checking the values, just checking the units are correct.
Then there're all the classes of problems that no computer program can currently accomplish, such as "check if a string has been correctly title cased in English" which would require a context-sensitive natural language parser to correctly detect the different senses of "like" in "Time Flies like an Arrow But Fruit Flies Like a Banana".

Related

Regex to replace a combination of number(digits/word) and a word

I use following code to replace a number and string to a replacement text
var rule = (\d+\s((apple\b|apples\b|Apple\b|Apples\b)+))
var search_regexp = new RegExp(rule, "ig");
return masterstring.replace(search_regexp,replacetext);
input string : 10 apples are better than 100 pears
replacement: 10 Oranges
Result: 10 Oranges are better than 100 pears
How is it possible to have a regular expression for handling 10 apples and Ten apples? Say one to identify
(a number in digits or word)+space+(a case insensitive word)
and replace this with 10 Oranges both using jQuery and php?
If you specifically only want to match valid number 'words' you would have to literally include in your regex all the numbers you want to include.
(one|two|three|four|five|six|seven|eight|nine|ten) etc.
This could be improved by combining words that start with the same letter:
(one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
You can then include your \d+ as your first option:
(\d+|one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
As some said in the comments you are using the case insensitive modifier, so I have done all lower case)
Note that if you want to go beyond ten this will become quite long, and hard to make efficient, I've had a quick go, and created a beast of a regex, I have not tried to optimise too much..
(?:
\d+
|t(?:en|hirteen)
|eleven
|twelve
|fifteen
|(?:
(?:twenty|thirty|fourty|fifty|sixty|seventy|eighty|ninety)
(?:[ -](?:one|t(?:wo|hree)|f(?:our|ive)|s(?:ix|even)|eight|nine))?
)
|(?:one|t(?:wo|hree)|f(?:our(?:teen)?|ive)|s(?:ix|even)(?:teen)?|eight(?:een)?|nine(?:teen)?)
)[ ]apples?
I have spread this over several lines and added the 'x' modifier in the online example - this makes it much easier to read, this works in PHP but not in javascript, you would have to remove the newlines/whitespace to use in JS)
[https://regex101.com/r/zDYme7/1](See working example online here)
Its also worth mentioning that doing this in regex may not be the best way - a string tokenizer would involve a lot less cpu time, but would involve more code.
One example of a tokenizer: https://www.npmjs.com/package/tokenize-text

PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group?
For instante catch
hahahaha
jajajaj
hihihi
It's fine to catch repetition of any char, like abababab, acacacacac.
Also, is there a way to count the number of repetition?
The idea is to catch all this "forms" of smiling on social media.
I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?
How about this:
preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches
A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).
To retrieve the number of repetitions for a specific match, you can do:
$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions
For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):
(.+?)\1+
See demo.
For the longest repetition (e.g. haha gets repeated in hahahaha):
(.+)\1+
Counting Repetitions
The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.
With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.
In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):
~^(a(?1)?b)$~
My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?
Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:
$regex = '~^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$~x';
var_dump(preg_match($regex, 'aabbcc')); // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc')); // 0
var_dump(preg_match($regex, 'aaaccc')); // 0
var_dump(preg_match($regex, 'aabcc')); // 0
var_dump(preg_match($regex, 'abbcc')); // 0
Try it yourself: http://codepad.viper-7.com/1erq9v
Explanation
If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:
~^a+(b(?-1)?c)$~
This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.
This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.
Conclusion
I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"
Here is an alternative solution using balancing groups with .NET regex:
^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$
Not PCRE, but may be of interest.
Example at: http://ideone.com/szhuE
Edit: Added the missing balancing check for the group a, and an online example.
My question is: How far can you go?
In the interests of not creating code that is an unreadable morass of punctuation, I'm going to risk the downvotes and answer a different, though very much related, question: how far should you go?
Regular expression parsers are a brilliant thing to have in your toolkit but they are not the be all and end all of programming. The ability to write parsers in a readable manner is also a brilliant thing to have in your toolkit.
Regular expressions should be used right up to the point where they start making your code hard to understand. Beyond that, their value is dubious at best, damaging at worst. For this specific case, rather than using something like the hideous:
~^(?=(a(?-1)?b)c)a+(b(?-1)?c)$~x
(with apologies to NikiC), which the vast majority of people trying to maintain it are either going to have to replace totally or spend substantial time reading up on and understanding, you may want to consider something like a non-RE, "proper-parser" solution (pseudo-code):
# Match "aa...abb...bcc...c" where:
# - same character count for each letter; and
# - character count is one or more.
def matchABC (string str):
# Init string index and character counts.
index = 0
dim count['a'..'c'] = 0
# Process each character in turn.
for ch in 'a'..'c':
# Count each character in the subsequence.
while index < len(str) and str[index] == ch:
count[ch]++
index++
# Failure conditions.
if index != len(str): return false # did not finish string.
if count['a'] < 1: return false # too few a characters.
if count['a'] != count['b']: return false # inequality a and b count.
if count['a'] != count['c']: return false # inequality a and c count.
# Otherwise, it was okay.
return true
This will be far easier to maintain in the future. I always like to suggest to people that they should assume those coming after them (who have to maintain the code they write) are psychopaths who know where you live - in my case, that may be half right, I have no idea where you live :-)
Unless you have a real need for regular expressions of this kind (and sometimes there are good reasons, such as performance in interpreted languages), you should optimise for readability first.
Qtax Trick
A solution that wasn't mentioned:
^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$
See what matches and fails in the regex demo.
This uses self-referencing groups (an idea #Qtax used on his vertical regex).

Compilation failed: POSIX collating elements are not supported

I've just installed a website & legacy CMS onto our server and I'm getting a POSIX compilation error. Luckily it's only appearing in the backend however the client's keen to get rid of it.
Warning: preg_match_all() [function.preg-match-all]: Compilation failed:
POSIX collating elements are not supported at offset 32 in
/home/kwecars/public_html/webEdition/we/include/we_classes/SEEM/we_SEEM.class.php
on line 621
From what I can tell it's the newer version of PHP causing the issue. Here's the code:
function getAllHrefs($code){
$trenner = "[\040|\n|\t|\r]*";
$pattern = "/<(a".$trenner."[^>]+href".$trenner."[=\"|=\'|=\\\\|=]*".$trenner.")
([^\'\">\040? \\\]*)([^\"\' \040\\\\>]*)(".$trenner."[^>]*)>/sie";
preg_match_all($pattern, $code, $allLinks); // ---- line 621
return $allLinks;
}
How can I tweak this to work on the newer version of php on this server?
Thanks in advance, my voodoo just isn't strong enough ;)
Your error message that “POSIX collating elements are not supported” deserves some explanation. After all, what in the world is a POSIX collating element anyway, and how can I avoid it?
The short answer is that you have an equals sign inside your square brackets in a place where its use is reserved for future use, assuming we ever get around to implementing it, which is anything but certain. You can tickle this in Perl on the command line this way, which gives a much better error message than PHP is providing:
% perl -le 'print "abc" =~ /[=foo=]/ || "Fail"'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[=foo=] <-- HERE / at -e line 1.
That’s the short answer; the longer answer follows.
Fancy POSIX Character Classes
Inside a square bracketed character class, POSIX admits three different nestedbracketed forms, all indicated using an extra symbol inside the brackets in pairs:
Named POSIX character classes, which are basically like Unicode properties, use an extra colon flanking: [:PROPERTY:], as in [:alpha:].
Collating elements intended to be treated as equivalent to each other, use an extra equals sign flanking them: [=ELEMENTS=], as in [=eéèëê=] in English or French, and [=vw=] in Swedish.
Polygraphs (digraphs, trigraphs, tetragraphs, etc), which are multicharacter elements meant to count as a single character, have an extra dot flanking them: [.DIGRAPH.], as in [.ch.] or [.ll.] per the traditional Spanish alphabet. These are sometimes known as contractions because two or more code points count as though that sequence were a single code point.
Perl supports only the first of these, not the second and third.
They are all awkward to use, because they must be nested inside an extra set of brackets, as in [[:punct:] to mean \pP or \p{punct}. You only need extra braces with Unicode properties when you are selecting one of many, as in [\pL\pN\pM\p{Pc}].
The Intent
The other two were an attempt to support locale-specific linguistic elements in a pre‐Unicode enviornment under legacy 8‑bit locales. For example, to express the traditional Spanish alphabet, which counts acute accents over vowels and diaereses over u’s as the same letter yet which counts a tilde over an n as a different letter altogether, and which furthermore has two digraphs each counting as a distinct letter, you would have to write this in POSIX:
[[=aá=]bc[.ch.]d[=eé=]fgh[=ií=]jkl[.ll.]mnñ[=oó=]pqrst[=uúü=]vwxyz]
You can and sometimes much combine these. For example, in German phonebooks where the three i‑mutated vowels can be spelt without diacritics by inserting a following e:
[a[=ä[.ae.]=]bcdefghijklmno[=ö[.oe.]=]pqrs[=ß[.ss.]=]tu[=ü[.ue.]=]vwxyz]
That way, assuming $ES and $DE are those languages’ respective alphabets, you could say something like
[$ES]{4}
and have it match words like guía, niño, llave, and choco in Spanish; or in German have
[$DE]{6}
and have it match words like tschüß or its uppercase undiacriticked equivalent, TSCHUESS.
The Unicode Way
This is awkward for various reasons, and not just those that are obvious from the two alphabets listed above. It does not admit the notion of combining characters, so you have to add those explicitly for non-normalized text, as in [=e\xE9[.e\x{301.]=].
Unicode has taken another path in how to implement linguistic elements like this. Fortunately, Unicode regular expressions per UTS#18 do not need to support language features tailored for specific languages or locales until Level 3. This is something no one yet has yet implemented.
Note that having SS and ß have the same casefold is not considered a locale tailoring. It is the full casefold for that code point no matter the linguistic context. So those are the same when case is ignored. Strange but true. Given that ß is code point U+00DF, we see that these are the same no matter the locale:
$ perl5.14.0 -E 'say "SS" =~ /^\xDF$/i ? "Pass" : "Fail"'
Pass
$ perl5.14.0 -E 'say "\xDF" =~ /^SS$/i ? "Pass" : "Fail"'
Pass
Although locale tailoring for patterns is still beyond us, collation has been implemented, including with locale support, and you can access it from Perl just fine.
However, PHP does not yet support Unicode collation.
References for Unicode collation include:
ICU’s Collation Concepts document
UTS#10: Unicode Collation Algorithm
Perl’s Unicode::Collate module.
Perl’s Unicode::Collate::Locale module.
[...] are character classes, they match any character between the brackets, you don't have to add | between them. See character classes.
So [abcd] will match a or b or c or d.
If you want to match alternations of more than one character, for example red or blue or yellow, use a sub pattern:
"(red|blue|yellow)"
And you guessed, [abcd] is equivalent to (a|b|c|d).
So here is what you could do for your regex:
For
$trenner = "[\040|\n|\t|\r]*";
Write this instead:
$trenner = "[\040\n\t\r]*";
And for
"[=\"|=\'|=\\\\|=]"
You could do
"(=\"|=\'|=\\\\|=)"
Or
"=[\"'\\\\]?"
BTW you could use \s instead of $trenner (see http://www.php.net/manual/en/regexp.reference.escape.php)

Categories