preg_match appears to hit a limit when using two matches - php

I have run up against an odd problem. it appears i am reaching some sort of limit with preg_replace while trying to use two matches using php-5.3.3
// works fine
$pattern_1 = '?START(.*)STOP?';
$string = 'START' . str_repeat('x',9999999) . 'STOP' ;
preg_match($pattern_1, $string , $matchedArray ) ;
$pattern_2 = '?START-ONE(.*)STOP-ONE.*START-TWO(.*)STOP-TWO.*?';
// works fine
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49970) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_2 ) ;
// didnt work
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49971) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_3 ) ;
The first option with only one match uses a very large string and has no problems.
The second option has a string length of 50,026 and works fine. the last option has a string length of 50,027 (one more) and the match no longer works. since the 49971 number can vary when the error occurs, it could be changed to something larger to simulate the problem.
Any ideas or thoughts? perhaps is this a php version issue? maybe a possible workaround is merely to only use one match rather than two and then run preg_match it twice ?

Ok, PHP's not very talkative about regex errors, it just returns false for the last case, which simply tells than an error occured, per the PHP docs.
I've reproduced the problem using PCRE (the regex engine used by preg_match) in C# (but with a much higher character count), and the error I'm getting is PCRE_ERROR_MATCHLIMIT.
This means you're hitting the backtracking limit set in PCRE. It's just a safety measure to prevent the engine from looping indefinitely, and I think your PHP configuration sets it to a low value.
To fix the issue, you can set a higher value for the pcre.backtrack_limit PHP option which controls this limit:
ini_set("pcre.backtrack_limit", "10000000"); // Actually, this is PCRE's default
On a side note:
You probably should replace (.*) with (.*?) to get less useless backtracking and for correctness (otherwise the regex engine will get past the STOP string and will have to backtrack to reach it)
Using ? as a pattern delimiter is a bad idea since it prevents you from using the ? metacharacter and therefore applying the above advice. Really, you should never use regex metacharacters as pattern delimiters.
If you're interested in more low-level details, here's the relevant bit of the PCRE docs (emphasis mine):
The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats.
Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.
When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is executed is entirely different. However, there is still the possibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue.
The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
A value for the match limit may also be supplied by an item at the start of a pattern of the form
(*LIMIT_MATCH=d)
where d is a decimal number. However, such a setting is ignored unless d is less than the limit set by the caller of pcre_exec() or, if no such limit is set, less than the default.

Related

Explain why regular expression is too large PHP/PCRE [duplicate]

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.
Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length
While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.
For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)

PHP - Why am I being warned that my regular expression is too large?

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
echos:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}:
Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases.
Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From "Handling Very Large Patterns" in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an
alternation metacharacter). By default, in the 8-bit and 16-bit
libraries, two-byte values are used for these offsets, leading to a
maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always
stored in big-endian order) by default. These are used, for example,
to link from the start of a subpattern to its alternatives and its
end. The use of 2 bytes per offset limits the size of the compiled
regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
There's a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of LINK_SIZE.
This defaults to 2 in the config.h file,
but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_* functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.
Looking at the 'regex' engine php uses, pcre here: http://pcre.sourceforge.net/pcre.txt at the limitations section it states:
The maximum length of a compiled pattern is 65539 (sic)
bytes
My guess is that some regex like this:
(123){1,3}
is compiled to something like this
(123)(123)?(123)?
Which makes it bigger than the maximum length
While I agree that the regex compiler shouldn't behave that way, you really shouldn't have encountered this problem. Inside the parentheses, your regex matches exactly one character from a specific set--the definition of a character class. The correct way to write your regex is to list all the characters inside one set of square brackets and forego the parentheses:
/^[a-z0-9 ,'.!?]{1,4000}$/i
That works fine, as this demo shows. However, it was the parentheses that were causing the error (even non-capturing parens cause it), and that doesn't seem right to me, even if they were unnecessary.
For me the problem was an un-escaped ? character
You need to escape it with not one, but to forward slashes \\
My regexp went from (?340202) to (\\?340202)

PCRE_UTF8 Modifier Extremely Slow

For some reason, simply adding the PCRE_UTF8 modifier to the regex input for preg_match() roughly decuples (x10) the execution time even if no multibyte characters are used. I can't figure out why this is the case and how to best reduce the time. The script used to test is:
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /u', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "u Modifier:\t".(($e-$s)/$i)."\n";
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "No Modifier:\t".(($e-$s)/$i)."\n";
Try it online here.
The results were:
u Modifier: 2.5037050247192E-5
No Modifier: 2.4969577789307E-6
I tried to see if this was a known bug online, but supposedly, it is not a problem with PHP.
What is this caused by and what would be the best method to execute the match1 quicker?
1"the match" refers to any match. The example used is simply a minimal example and could obviously be matched in much better ways.
PCRE checks for UTF validity before any other processing takes place.
From the PCRE docs:
When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar(), which is used for this purpose after a UTF error.
...
The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.
...
In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
(Note: These docs are quoted from PCRE2, but the PCRE behavior is the same)
Unfortunately, I don't think there's a way to set the PCRE2_NO_UTF_CHECK option from PHP.
Anyway, your benchmark should go over much more iterations to be meaningful. You should measure the time over several seconds worth of computing time to get a better sense of the impact this feature has.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):
~^(a(?1)?b)$~
My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?
Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:
$regex = '~^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$~x';
var_dump(preg_match($regex, 'aabbcc')); // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc')); // 0
var_dump(preg_match($regex, 'aaaccc')); // 0
var_dump(preg_match($regex, 'aabcc')); // 0
var_dump(preg_match($regex, 'abbcc')); // 0
Try it yourself: http://codepad.viper-7.com/1erq9v
Explanation
If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:
~^a+(b(?-1)?c)$~
This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.
This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.
Conclusion
I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"
Here is an alternative solution using balancing groups with .NET regex:
^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$
Not PCRE, but may be of interest.
Example at: http://ideone.com/szhuE
Edit: Added the missing balancing check for the group a, and an online example.
My question is: How far can you go?
In the interests of not creating code that is an unreadable morass of punctuation, I'm going to risk the downvotes and answer a different, though very much related, question: how far should you go?
Regular expression parsers are a brilliant thing to have in your toolkit but they are not the be all and end all of programming. The ability to write parsers in a readable manner is also a brilliant thing to have in your toolkit.
Regular expressions should be used right up to the point where they start making your code hard to understand. Beyond that, their value is dubious at best, damaging at worst. For this specific case, rather than using something like the hideous:
~^(?=(a(?-1)?b)c)a+(b(?-1)?c)$~x
(with apologies to NikiC), which the vast majority of people trying to maintain it are either going to have to replace totally or spend substantial time reading up on and understanding, you may want to consider something like a non-RE, "proper-parser" solution (pseudo-code):
# Match "aa...abb...bcc...c" where:
# - same character count for each letter; and
# - character count is one or more.
def matchABC (string str):
# Init string index and character counts.
index = 0
dim count['a'..'c'] = 0
# Process each character in turn.
for ch in 'a'..'c':
# Count each character in the subsequence.
while index < len(str) and str[index] == ch:
count[ch]++
index++
# Failure conditions.
if index != len(str): return false # did not finish string.
if count['a'] < 1: return false # too few a characters.
if count['a'] != count['b']: return false # inequality a and b count.
if count['a'] != count['c']: return false # inequality a and c count.
# Otherwise, it was okay.
return true
This will be far easier to maintain in the future. I always like to suggest to people that they should assume those coming after them (who have to maintain the code they write) are psychopaths who know where you live - in my case, that may be half right, I have no idea where you live :-)
Unless you have a real need for regular expressions of this kind (and sometimes there are good reasons, such as performance in interpreted languages), you should optimise for readability first.
Qtax Trick
A solution that wasn't mentioned:
^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$
See what matches and fails in the regex demo.
This uses self-referencing groups (an idea #Qtax used on his vertical regex).

Categories