PHP preg_match length 3276 limit - php

It appears that PHP's preg_match has a 3276 character limit for matching repeating characters in some cases.
i.e.
^(.|\s){0,3276}$ works, but ^(.|\s){0,3277}$ does not.
It doesn't seem to always apply, as /^(.){0,3277}$/ works.
I can't find this mentioned anywhere in PHP's documentation or the bug tracker. The number 3276 seems a bit of an odd boundary, the only thing I can think of is that it's approximately 1/10th of 32767, which is the limit for a signed 16-bit integer.
preg_last_error() returns 0.
I've reproduced the issue on http://www.phpliveregex.com/ as well as my local system and the webserver.
EDIT: Looks like we're getting "Warning: preg_match(): Compilation failed: regular expression is too large at offset 16" out of the code, so it appears to be the same issue as PHP preg_match_all limit.
However, the regex itself isn't very large... Does PHP do some kind of expansion when you have repeating groups that's making it too large?

In order to handle Perl-compatible regular expressions, PHP just bundles a third-party library that takes care of the job. The behaviour you describe is actually documented:
The "*" quantifier is equivalent to {0,} , the "+" quantifier to {1,}
, and the "?" quantifier to {0,1} . n and m are limited to
non-negative integral values less than a preset limit defined when
perl is built. This is usually 32766 on the most common platforms.
So there's always a hard limit. Why do your tests suggest that PHP limit is 10 times smaller than the typical one? No idea about that :)

Try using ^(.|\s){0,3276}(.|\s){0,1}$

Related

Extract all words between two phrases using regex [duplicate]

This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?
Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.
This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

How to trigger Regex Denial-of-Service in PHP?

How can I trigger a Regex-DOS using the preg_match() function using an evil regular expression (e.g. (a+)+ )?
For example, I have the following situation:
preg_match('/(a+)+/',$input);
If I have control over $input, how could I trigger a DOS attack or reach the backtrack limit of the preg_* functions in php?
How could I do this with the following expressions?
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} | for x > 10
There is no way to trigger ReDOS on (a+)+, ([a-zA-Z]+)*, (a|aa)+, (a|a?)+ , since there is nothing that can cause match failure and trigger backtracking after the problematic part of the regex.
If you modify the regex a bit, for example, adding b$ after each of the regex above, then you can trigger catastrophic backtracking with an input like aaa...aabaa...aa.
Depending on the engine's implementation and optimization, there are cases where we expect catastrophic backtracking, but the engine doesn't exhibit any sign of such behavior.
For example, given (a+)+b and the input aaa...aac, PCRE fails the match outright, since it has an optimization that checks for required character in the input string before starting the match proper.
Knowing what the engine does, we can throw off its early detection with the input aaa...aacb and get the engine to exhibit catastrophic backtracking.
As for (.*a){x}, it is possible to trigger ReDOS, since it has a failing condition of less than x iterations. Given the input string aaa...a (with x or more of character a), the regex keeps trying all permutations of a's at the end of the string as it backtracks away from the end of the string. Therefore, the complexity of the regex is O(2x). Knowing that, we can tell that the effect is more visible when x is larger number, let's say 20. By the way, this is one rare case where a matching string has the worst case complexity.

"Regular Expression is too large" error in PHP

I am working on a relatively complex, and very large regular expression. It is currently 41,127 characters, and may grow somewhat as additional cases may be added. I am starting to get this error in PHP:
preg_match_all(): Compilation failed: regular expression is too large at offset 41123
Is there a way to increase the size limit? The following settings suggested elsewhere did NOT work because these apply to size of data and NOT the regex size:
ini_set("pcre.backtrack_limit", "100000000");
ini_set("pcre.recursion_limit", "100000000");
Alternatively, is there a way to define a "sub-pattern variable" within the regex that could be repeated at various places within the regex? (I am not talking about repetition using * or +, or even repeating matched "1")? I am actually using PHP variables containing sub-Patterns that are repeated in few places within the regex, but this leads to expansion of the regex BEFORE it is passed on to PRCE functions.
This is a complex regular expression, and cannot be replaced by simpler keyword-searching using strpos or similar as suggested at this link.
I would prefer to avoid splitting this into sub-expressions at | and trying to match the sub-expressions separately, because the reduction in size would be modest (there are only 2 or 3 of top-level |), and this would complicate further development.
Depending on the application, valid solutions are:
Shorten the Regular Expression by using DEFINE for any redundant sub-expressions (see below).
Increase the max limit on regex size by re-compiling PHP (see drew010's great answer). Although this may not be available in all environments or may create compatibility issues if changing servers.
Split your regular expression at | and process the resulting sub-expressions separately. If the regex is essentially numerous keywords separated by |, then converting to a strtok or a loop with strpos may be a better & faster choice.
Use other language / regex engine such as C++/Boost, although I did not verify this.
Solution to my specific problem: As per Mario's comment, using the (?(DEFINE)...) construct for some of the sub-expressions that were re-used several times reduced my regex size from 41,127 characters down to "only" 4,071, and this was an elegant solution to get rid of the error “Regular Expression is too large.”
See: (?(DEFINE)...) syntax reference at rexegg.com
I don't disagree with the comments that there might be a better way to do this, but I will answer the question here.
You can increase the maximum size of the regex but only by recompiling PHP yourself. Because of this, your code is not at all portable and if you are using pre-compiled binaries, you are out of luck.
That said, I would suggest finding an alternative for matching.
See pcre_internal.h for comments.
PCRE keeps offsets in its compiled code as 2-byte quantities
(always stored in big-endian order) by default. These are used, for
example, to link from the start of a subpattern to its alternatives
and its end. The use of 2 bytes per offset limits the size of the
compiled regex to around 64K, which is big enough for almost
everybody. However, I received a request for an even bigger limit.
For this reason, and also to make the code easier to maintain, the
storing and loading of offsets from the byte string is now handled
by the macros that are defined here.
The macros are
controlled by the value of LINK_SIZE. This defaults to 2 in the
config.h file, but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
So you can either edit ext/pcre/pcrelib/config.h from the PHP source distribution to increase the size limit, or specify it when compiling ./configure -DLINK_SIZE=4
EDIT: If you are trying to match/parse HTML, I would recommend using DOMDocument to parse the HTML and then walk the DOM tree or build an XPATH to find what you're looking for.
have your tried array_chunk to split your array then use you preg_match_all in foreach(). I was using the exact same code and i have an array of 40k+, so i go through the above solutions but it didn't solved my "Compilation failed: regular expression is too large at offset" problem then i split my 40k+ array into 4 arrays of 1k elements and and used foreach() over my preg_match_all condition and voila! it worked.

Different results for unicode/multibyte modifier and mb_ereg_replace

This regex seems to be very problematic:
(((?!a).)*)*a\{
I know the regex is terrible. That is not the question here.
when tested with this string:
AAAAAAAAAAAAAA{AA
The letters A and a could be replaced with pretty much anything and result in the same problem.
This regex and test string pair is condensed. The full example can be found here.
This is the code that I used to test:
<?php
$regex = '(((?!a).)*)*a\\{';
$test_string = 'AAAAAAAAAAAAAA{AA';
echo "1:".mb_ereg_replace('/'.$regex.'/','WORKED',$test_string)."\n";
echo "2:".preg_replace('/'.$regex.'/u','WORKED',$test_string)."\n";
echo "3:".preg_replace('/'.$regex.'/','WORKED',$test_string)."\n";
The results can be viewed here:
http://3v4l.org/Yh6FU
The ideal result would be that the same test string is returned because the regex does not match.
When using preg_replace with the u modifier, it should have the same results as mb_ereg_replace according to this comment:
php multi byte strings regex
mb_ereg_replace works exactly as it should. It returns the test string because the regex does not match.
However, preg_replace for PHP versions other than 4.3.4 - 4.4.5, 4.4.9 - 5.1.6 does not seem to work.
For some PHP versions, the result is an error:
Process exited with code 139.
For some other PHP versions, the result is NULL
For the rest, mb_ereg_replace had not yet been made
Also, removing just a single letter from either the string or the regex seems to completely alter which versions of PHP have which results.
Judging from this comment:
php multi byte strings regex
ereg* should be avoided, which makes sense since it is slower and supports less than preg* does. This makes using mb_ereg_replace undesirable. However, there is not a mb_preg_replace option, so this seems to be the only option that works.
So, my question is:
Is there any alternative to mb_ereg_replace that would work correctly for the given string and regex pair?
Do you know the difference between (...) and (?:...)?
(...) ... this defines a marking group. The string found by the expression within the round brackets is internally stored in a variable for back referencing.
(?:...) ... this defines a non marking group. The string found by the expression within the parentheses is not internally stored. Such a non marking group is often used to apply an expression several times on a string.
Now let us take a look on your expression (((?!a).)*)*a\{ which on usage in a Perl regular expression find in text editor UltraEdit results in the error message The complexity of matching expression has exceeded available resources.
(?!a). ... a character should be found where next character is not letter 'a'. Okay. But you want find a string with 0 or more characters up to letter 'a'. Your solution is: ((?!a).)*)
That is not a good solution as the engine has now on each character to lookahead for letter 'a', and if the next character is not an 'a', match the character, store it as a string for back referencing and then continue on next character. Actually I don't even know what happens internally when a multiplier is used on a marking group as done here. A multiplier should be never used on a marking group. So better would be (?:(?!a).)*.
Next you extend the expression to (((?!a).)*)*. One more marking group with a multiplier?
It looks like you want mark the entire string not containing letter 'a'. But in this case it would be better to use: ((?:(?!a).)*) as this defines 1 and only 1 marking group for the string found by the inner expression.
So the better expression would be ((?:(?!a).)*)a\{ as there is now only 1 marking group without a multiplier on the marking group. Now the engine knows exactly which string to store in a variable.
Much faster would be ([^a]*?)a\{ as this non greedy negative character class definition matches also a string of 0 or more characters left of a{ not containing letter 'a'. Look ahead should be avoided if not necessary as this avoids backtracking.
I don't know the source code of the PHP functions mb_ereg_replace and preg_replace which would be needed to be examined with the expression step by step to find out what exactly is the reason for the different results.
However, the expression (((?!a).)*)*a\{ results definitely in a heavy recursion as it is not defined when to stop matching data and what to store temporarily. So both functions (most likely) allocate more and more memory from stack and perhaps also from heap until either a stack overflow or a "not enough free memory" exception occurs.
Exit code 139 is a segmentation fault (memory boundary violation) caused by a not caught stack overflow, or NULL was returned on allocating more memory from heap with malloc() and the return value NULL was ignored. I suppose, returning NULL by malloc() is the reason for exit code 139.
So the difference makes most like the error respectively exception handling of the two functions. Catching a memory exception or counting the recursive iterations with an exit on too many of them to prevent a memory exception before it really occurs could be the reason for the different behavior on this expression.
It is hard to give a definite answer what makes the difference without knowing source code of the functions mb_ereg_replace and preg_replace, but in my point of view it does not really matter.
The expression (((?!a).)*)*a\{ results always in a heavy recursion as Sam has reported already in his first comment. More than 119000 steps (= function calls) during a replace on a string with just 17 characters is a strong sign for something is wrong with the expression. The expression can be used to let the function or entire application (PHP interpreter) run into abnormal error handling, but not for a real replace. So this expression is good for the developers of the PHP functions for testing error handling on an endless recursion, but not for a real replace operation.
The full regular expression as used in referenced PHP sandbox:
(?<!<br>)(?<!\s)\s*(\((?:(?:(?!<br>|\(|\)).)*(?:\((?:(?!<br>|\(|\)).)*\))?)*?\))\s*(\{)
It is hard to analyze this search string in this form.
So let us look on the search string like it would be a code snippet with indentations for better understanding the conditions and loops in this expression.
(?<!<br>)(?<!\s)\s*
(
\(
(?:
(?:
(?!<br>|\(|\)).
)*
(?:
\(
(?:
(?!<br>|\(|\)).
)*
\)
)?
)*?
\)
)
\s*
(\{)
I hope, it is now easier to see the recursion in this search string. There is twice the same block, but not in sequence order, but in nested order, a classic recursion.
And additionally all the expressions including the nested expressions forming a recursion before the final (\{) which can match any character are with the multipliers * or ? which mean can exist, but must not exist. The presence of { is the only real condition for the entire search string. Everything else is optional and this is not good because of the recursion in this search string.
It is very bad for a recursive search expression if it is completely unclear where to start and where to stop selecting characters as it results in an endless recursion until abnormal exit.
Let me explain this problem with a simple expression like [A-Za-z]+([a-z]+)
1 or more letters in upper or lower case followed by 1 or more characters in lower case (and case-sensitive search is enabled). Simple, isn't it.
But the second character class defines a set of characters which is a subset of the set of characters defined by the first class definition. And this is not good.
What should be tagged by the expression in parentheses on a string like York?
ork or rk or just k or even nothing because no matching string found as the first character class can match already the entire word and therefore nothing left for second character class?
The Perl regular expression library solved such this common problem by declaring the multipliers * and + by default as greedy except ? is used after a multiplier which results in the opposite matching behavior. Those 2 additional rules help already on this problem.
Therefore the expression as used here marks only k and with [A-Za-z]+?([a-z]+) the string ork is marked and with [A-Za-z]+?([a-z]+?) just first o is marked.
And there is one more rule: favor a positive result over a negative result. This additional rule avoids that the first character class selects already the entire word York.
So main problem with partly or completely overlapping sets of characters solved.
But what happens if such an expression is put in a recursion and making it even more complex by using lookahead / lookbehind and backtracking, and backtracking is done not only by 1 character, but even by multiple characters?
Is it still clearly defined where to start and stop selecting characters for every expression part of the entire search string?
No, it is not.
With a search string where there is no clear rule which part of a search string is selected by which part of the search expression, every result is more or less valid including the unexpected ones.
And additionally it can happen easily because of the missing start/stop conditions that the functions fail completely to apply the expression on a string and exit abnormal.
An abnormal exit on applying a search string is surely always an unexpected result for the human who used the search expression.
Different versions of a search functions may return different results on an expression which let the search functions run into an abnormal function exit. The developers of the search functions continuously change the program code of the search functions to better detect and handle search expressions resulting in an endless recursion as this is simply a security problem. A regular expression search allocating more or more memory from application's stack or entire RAM is very problematic for the security, stability and availability of the entire machine on which this application is running. And PHP is used mainly on servers which should not stop working because a recursive memory allocation occupies more and more RAM from the server as this would finally kill the entire server.
This is the reason why you get different results depending on the used PHP version.
I looked very long on your complete search expression and let it run several times on the example string. But honestly I could not find out what should be found and what should be ignored by the expression left of (\{).
I understand parts of the expression, but why is there a recursion in the search string at all?
What is the purpose of the negative lookbehind (?<!\s) on \s*?
\s* matches 0 or more white-spaces and therefore the purpose for the expression "previous character not being a whitespace" is not comprehensible for me. The negative lookbehind is simply useless in my point of view and just increases the complexity of the entire expression. And this is just the beginning.
I am quite sure that what you really want can be achieved with a much simpler expression not having a recursion resulting a abnormal function exits depending on searched string and with all or nearly all backtracking steps removed.

Why does this regex take so long to find email addresses in certain files?

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):
/[a-z0-9_\-\+]+#[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
I'm using preg_match_all() in PHP.
This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:
wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...
Here are two pages causing the problem. View source to see the long strings.
http://www.ashrae.org/members/page/607
http://www.ashrae.org/publications/page/2010ajindex
Any thoughts on what is causing this?
--FINAL SOLUTION--
I tested various regexes suggested in the answers. #FailedDev's answer helped and dropped processing time from a few minutes to a few seconds. #hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It's #hakre's second suggestion.
/[a-z0-9_\-\+]{1,256}+#[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?
For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.
If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":
/[a-z0-9_\-\+]{1,256}#[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i
That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:
/[a-z0-9_\-\+]{1,256}+#[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
Which should speed it up again.
My best guess would be to try using possesive quantifiers :
[a-z0-9_\-\+]+
to
[a-z0-9_\-\+]++
This should fail the regex faster so it may improve performance in these situations.
Edit:
Maybe atomic grouping could also help :
/(?>[a-z0-9_\-+]++)#(?>[a-z0-9\-]++\.)(?>[a-z]{2,3})(?:\.[a-z]{2})?/
You should first go with option one. It would be interesting to see if there is any difference by also using option two.

Categories