Different results for unicode/multibyte modifier and mb_ereg_replace

Different results for unicode/multibyte modifier and mb_ereg_replace - php

This regex seems to be very problematic:
(((?!a).)*)*a\{
I know the regex is terrible. That is not the question here.
when tested with this string:
AAAAAAAAAAAAAA{AA
The letters A and a could be replaced with pretty much anything and result in the same problem.
This regex and test string pair is condensed. The full example can be found here.
This is the code that I used to test:
<?php
$regex = '(((?!a).)*)*a\\{';
$test_string = 'AAAAAAAAAAAAAA{AA';
echo "1:".mb_ereg_replace('/'.$regex.'/','WORKED',$test_string)."\n";
echo "2:".preg_replace('/'.$regex.'/u','WORKED',$test_string)."\n";
echo "3:".preg_replace('/'.$regex.'/','WORKED',$test_string)."\n";
The results can be viewed here:
http://3v4l.org/Yh6FU
The ideal result would be that the same test string is returned because the regex does not match.
When using preg_replace with the u modifier, it should have the same results as mb_ereg_replace according to this comment:
php multi byte strings regex
mb_ereg_replace works exactly as it should. It returns the test string because the regex does not match.
However, preg_replace for PHP versions other than 4.3.4 - 4.4.5, 4.4.9 - 5.1.6 does not seem to work.
For some PHP versions, the result is an error:
Process exited with code 139.
For some other PHP versions, the result is NULL
For the rest, mb_ereg_replace had not yet been made
Also, removing just a single letter from either the string or the regex seems to completely alter which versions of PHP have which results.
Judging from this comment:
php multi byte strings regex
ereg* should be avoided, which makes sense since it is slower and supports less than preg* does. This makes using mb_ereg_replace undesirable. However, there is not a mb_preg_replace option, so this seems to be the only option that works.
So, my question is:
Is there any alternative to mb_ereg_replace that would work correctly for the given string and regex pair?

Do you know the difference between (...) and (?:...)?
(...) ... this defines a marking group. The string found by the expression within the round brackets is internally stored in a variable for back referencing.
(?:...) ... this defines a non marking group. The string found by the expression within the parentheses is not internally stored. Such a non marking group is often used to apply an expression several times on a string.
Now let us take a look on your expression (((?!a).)*)*a\{ which on usage in a Perl regular expression find in text editor UltraEdit results in the error message The complexity of matching expression has exceeded available resources.
(?!a). ... a character should be found where next character is not letter 'a'. Okay. But you want find a string with 0 or more characters up to letter 'a'. Your solution is: ((?!a).)*)
That is not a good solution as the engine has now on each character to lookahead for letter 'a', and if the next character is not an 'a', match the character, store it as a string for back referencing and then continue on next character. Actually I don't even know what happens internally when a multiplier is used on a marking group as done here. A multiplier should be never used on a marking group. So better would be (?:(?!a).)*.
Next you extend the expression to (((?!a).)*)*. One more marking group with a multiplier?
It looks like you want mark the entire string not containing letter 'a'. But in this case it would be better to use: ((?:(?!a).)*) as this defines 1 and only 1 marking group for the string found by the inner expression.
So the better expression would be ((?:(?!a).)*)a\{ as there is now only 1 marking group without a multiplier on the marking group. Now the engine knows exactly which string to store in a variable.
Much faster would be ([^a]*?)a\{ as this non greedy negative character class definition matches also a string of 0 or more characters left of a{ not containing letter 'a'. Look ahead should be avoided if not necessary as this avoids backtracking.
I don't know the source code of the PHP functions mb_ereg_replace and preg_replace which would be needed to be examined with the expression step by step to find out what exactly is the reason for the different results.
However, the expression (((?!a).)*)*a\{ results definitely in a heavy recursion as it is not defined when to stop matching data and what to store temporarily. So both functions (most likely) allocate more and more memory from stack and perhaps also from heap until either a stack overflow or a "not enough free memory" exception occurs.
Exit code 139 is a segmentation fault (memory boundary violation) caused by a not caught stack overflow, or NULL was returned on allocating more memory from heap with malloc() and the return value NULL was ignored. I suppose, returning NULL by malloc() is the reason for exit code 139.
So the difference makes most like the error respectively exception handling of the two functions. Catching a memory exception or counting the recursive iterations with an exit on too many of them to prevent a memory exception before it really occurs could be the reason for the different behavior on this expression.
It is hard to give a definite answer what makes the difference without knowing source code of the functions mb_ereg_replace and preg_replace, but in my point of view it does not really matter.
The expression (((?!a).)*)*a\{ results always in a heavy recursion as Sam has reported already in his first comment. More than 119000 steps (= function calls) during a replace on a string with just 17 characters is a strong sign for something is wrong with the expression. The expression can be used to let the function or entire application (PHP interpreter) run into abnormal error handling, but not for a real replace. So this expression is good for the developers of the PHP functions for testing error handling on an endless recursion, but not for a real replace operation.
The full regular expression as used in referenced PHP sandbox:
(?<!<br>)(?<!\s)\s*(\((?:(?:(?!<br>|\(|\)).)*(?:\((?:(?!<br>|\(|\)).)*\))?)*?\))\s*(\{)
It is hard to analyze this search string in this form.
So let us look on the search string like it would be a code snippet with indentations for better understanding the conditions and loops in this expression.
(?<!<br>)(?<!\s)\s*
(
\(
(?:
(?:
(?!<br>|\(|\)).
)*
(?:
\(
(?:
(?!<br>|\(|\)).
)*
\)
)?
)*?
\)
)
\s*
(\{)
I hope, it is now easier to see the recursion in this search string. There is twice the same block, but not in sequence order, but in nested order, a classic recursion.
And additionally all the expressions including the nested expressions forming a recursion before the final (\{) which can match any character are with the multipliers * or ? which mean can exist, but must not exist. The presence of { is the only real condition for the entire search string. Everything else is optional and this is not good because of the recursion in this search string.
It is very bad for a recursive search expression if it is completely unclear where to start and where to stop selecting characters as it results in an endless recursion until abnormal exit.
Let me explain this problem with a simple expression like [A-Za-z]+([a-z]+)
1 or more letters in upper or lower case followed by 1 or more characters in lower case (and case-sensitive search is enabled). Simple, isn't it.
But the second character class defines a set of characters which is a subset of the set of characters defined by the first class definition. And this is not good.
What should be tagged by the expression in parentheses on a string like York?
ork or rk or just k or even nothing because no matching string found as the first character class can match already the entire word and therefore nothing left for second character class?
The Perl regular expression library solved such this common problem by declaring the multipliers * and + by default as greedy except ? is used after a multiplier which results in the opposite matching behavior. Those 2 additional rules help already on this problem.
Therefore the expression as used here marks only k and with [A-Za-z]+?([a-z]+) the string ork is marked and with [A-Za-z]+?([a-z]+?) just first o is marked.
And there is one more rule: favor a positive result over a negative result. This additional rule avoids that the first character class selects already the entire word York.
So main problem with partly or completely overlapping sets of characters solved.
But what happens if such an expression is put in a recursion and making it even more complex by using lookahead / lookbehind and backtracking, and backtracking is done not only by 1 character, but even by multiple characters?
Is it still clearly defined where to start and stop selecting characters for every expression part of the entire search string?
No, it is not.
With a search string where there is no clear rule which part of a search string is selected by which part of the search expression, every result is more or less valid including the unexpected ones.
And additionally it can happen easily because of the missing start/stop conditions that the functions fail completely to apply the expression on a string and exit abnormal.
An abnormal exit on applying a search string is surely always an unexpected result for the human who used the search expression.
Different versions of a search functions may return different results on an expression which let the search functions run into an abnormal function exit. The developers of the search functions continuously change the program code of the search functions to better detect and handle search expressions resulting in an endless recursion as this is simply a security problem. A regular expression search allocating more or more memory from application's stack or entire RAM is very problematic for the security, stability and availability of the entire machine on which this application is running. And PHP is used mainly on servers which should not stop working because a recursive memory allocation occupies more and more RAM from the server as this would finally kill the entire server.
This is the reason why you get different results depending on the used PHP version.
I looked very long on your complete search expression and let it run several times on the example string. But honestly I could not find out what should be found and what should be ignored by the expression left of (\{).
I understand parts of the expression, but why is there a recursion in the search string at all?
What is the purpose of the negative lookbehind (?<!\s) on \s*?
\s* matches 0 or more white-spaces and therefore the purpose for the expression "previous character not being a whitespace" is not comprehensible for me. The negative lookbehind is simply useless in my point of view and just increases the complexity of the entire expression. And this is just the beginning.
I am quite sure that what you really want can be achieved with a much simpler expression not having a recursion resulting a abnormal function exits depending on searched string and with all or nearly all backtracking steps removed.

Related

Extract all words between two phrases using regex [duplicate]

This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?

Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.

This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.

If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*

Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80

Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

How to trigger Regex Denial-of-Service in PHP?

How can I trigger a Regex-DOS using the preg_match() function using an evil regular expression (e.g. (a+)+ )?
For example, I have the following situation:
preg_match('/(a+)+/',$input);
If I have control over $input, how could I trigger a DOS attack or reach the backtrack limit of the preg_* functions in php?
How could I do this with the following expressions?
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} | for x > 10

There is no way to trigger ReDOS on (a+)+, ([a-zA-Z]+)*, (a|aa)+, (a|a?)+ , since there is nothing that can cause match failure and trigger backtracking after the problematic part of the regex.
If you modify the regex a bit, for example, adding b$ after each of the regex above, then you can trigger catastrophic backtracking with an input like aaa...aabaa...aa.
Depending on the engine's implementation and optimization, there are cases where we expect catastrophic backtracking, but the engine doesn't exhibit any sign of such behavior.
For example, given (a+)+b and the input aaa...aac, PCRE fails the match outright, since it has an optimization that checks for required character in the input string before starting the match proper.
Knowing what the engine does, we can throw off its early detection with the input aaa...aacb and get the engine to exhibit catastrophic backtracking.
As for (.*a){x}, it is possible to trigger ReDOS, since it has a failing condition of less than x iterations. Given the input string aaa...a (with x or more of character a), the regex keeps trying all permutations of a's at the end of the string as it backtracks away from the end of the string. Therefore, the complexity of the regex is O(2x). Knowing that, we can tell that the effect is more visible when x is larger number, let's say 20. By the way, this is one rare case where a matching string has the worst case complexity.

Regex atomic grouping does not seem to work in preg_match_all()

I've recently been playing with regular expressions, and one thing doesn't work as expected for me when I use preg_match_all in php.
I'm using an online regex tool at http://www.solmetra.com/scripts/regex/index.php.
The regex I'm using is /(?>x|y|z)w/. I'm making it match abyxw. I am expecting it to fail, yet it succeeds, and matches xw.
I am expecting it to fail, due to the use of atomic grouping, which, from what I have read from multiple sources, prevents backtracking. What I am expecting precisely is that the engine attempts to match the y with alternation and succeeds. Later it attempts to match w with the regex literal w and fails, because it encounters x. Then it would normally backtrack, but it shouldn't in this case, due to the atomic grouping. So from what I know it should keep trying to match y with this atomic group. Yet it does not.
I would appreciate any light shed on this situation. :)

This is a little bit tricky, but there are two things that the regex can try to do when it cannot find a match:
Advance the starting position - If the match cannot succeed at an index i, it will be attempted again starting at index i+1, and this will continue until it reaches the end of the string.
Backtracking - If repetition or alternation is used in the regex, then the regex engine can discard part of an unsuccessful match and try again by using less or more of the repetition, or a different element in the alternation.
Atomic groups prevent backtracking but they do not affect advancing the starting position.
In this case, the match will fail when the engine is trying to match with y as the first character, but then it will move on and see xw as the remainder of the string, which will match.

In RegEx, how do you find a line that contains no more than 3 unique characters?

I am looping through a large text file and im looking for lines that contain no more than 3 different characters (those characters, however, can be repeated indefinitely). I am assuming the best way to do this would be some sort of regular expression.
All help is appreciated.
(I am writing the script in PHP, if that helps)

Regex optimisation fun time exercise for kids! Taking gnarf's regex as a starting point:
^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$
I noticed that there were nested and sequential *s here, which can cause a lot of backtracking. For example in 'abcaaax' it will try to match that last string of ‘a’s as a single \1* of length 3, a \1* of length two followed by a single \1, a \1 followed by a 2-length \1*, or three single-match \1s. That problem gets much worse when you have longer strings, especially when due to the regex there is nothing stopping \1 from being the same character as \2.
^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$
This was over twice as fast as the original, testing on Python's PCRE matcher. (It's quicker than setting it up in PHP, sorry.)
This still has a problem in that (.)? can match nothing, and then carry on with the rest of the match. \1|\2 will still match \1 even if there is no \2 to match, resulting in potential backtracking trying to introduce the \1|\2 and \1|\2|\3 clauses earlier when they can't result in a match. This can be solved by moving the ? optionalness around the whole of the trailing clauses:
^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$
This was twice as fast again.
There is still a potential problem in that any of \1, \2 and \3 can be the same character, potentially causing more backtracking when the expression does not match. This would stop it by using a negative lookahead to not match a previous character:
^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$
However in Python with my random test data I did not notice a significant speedup from this. Your mileage may vary in PHP dependent on test data, but it might be good enough already. Possessive-matching (*+) might have helped if this were available here.
No regex performed better than the easier-to-read Python alternative:
len(set(s))<=3
The analogous method in PHP would probably be with count_chars:
strlen(count_chars($s, 3))<=3
I haven't tested the speed but I would very much expect this to be faster than regex, in addition to being much, much nicer to read.
So basically I just totally wasted my time fiddling with regexes. Don't waste your time, look for simple string methods first before resorting to regex!

At the risk of getting downvoted, I will suggest regular expressions are not meant to handle this situation.
You can match a character or a set of characters, but you can't have it remember what characters of a set have already been found to exclude those from further match.
I suggest you maintain a character set, you reset it before you begin with a new line, and you add there elements while going over the line. As soon as the count of elements in the set exceeds 3, you drop the current line and proceed to the next.

Perhaps this will work:
preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail
Explaination:
/
^ #start of string
(.) #match any character in group 1
\\1* #match whatever group 1 was 0 or more times
(.)? #match any character in group 2 (optional)
(?:\\1*\\2*)* #match group 1 or 2, 0 or more times, 0 or more times
#(non-capture group)
(.)? #match any character in group 3 (optional)
(?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
#(non-capture group)
$ #end of string
/
An added benifit, $matches[1], [2], [3] will contain the three characters you want. The regular expression looks for the first character, then stores it and matches it up until something other than that character is found, catches that as a second character, matching either of those characters as many times as it can, catches the third character, and matches all three until the match fails or the string ends and the test passes.
EDIT
This regexp will be much faster because of the way the parsing engine and backtracking works, read bobince's answer for the explanation:
/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/

for me - as a programmer with fair-enough regular expression knowledge this sounds not like a problem that you can solve using Regexp only.
more likely you will need to build a hashMap/array data structure key: character value:count and iterate the large text file, rebuilding the map for each line. at each new character check if the already-encountered character count is 2, if so, skip current line.
but im keen to be suprised if one mad regexp hacker will come up with a solution.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Different results for unicode/multibyte modifier and mb_ereg_replace - php

Related

Extract all words between two phrases using regex [duplicate]

PHP Regex detect repeated character in a word

How to trigger Regex Denial-of-Service in PHP?

Regex atomic grouping does not seem to work in preg_match_all()

In RegEx, how do you find a line that contains no more than 3 unique characters?

Categories

Resources