How to simplify this regex to avoid recursion? - php

Regex:
(?|`(?>[^`\\]|\\.|``)*`|'(?>[^'\\]|\\.|'')*'|"(?>[^"\\]|\\.|"")*"|(\?{1,2})|(:{1,2})([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*))
Example input:
INSERT INTO xyz WHERE
a=?
and b="what?"
and ??="cheese"
and `col?`='OK'
and ::col='another'
and last!=:least
https://regex101.com/r/HnTVXx/6
It should match ?, ??, :xyz and ::xyz but not if they are inside of a backquoted-string, double-quoted string, or single-quoted string.
When I try running this in PHP with a very large input I get PREG_RECURSION_LIMIT_ERROR from preg_last_error().
How can I simplify this regex pattern so that it doesn't do so much recursion?
Here's some test code that shows the error in PHP using Niet's optimized regex: https://3v4l.org/GdtmP Error code 6 is PREG_JIT_STACKLIMIT_ERROR. The other one I've seen is 3=PREG_RECURSION_LIMIT_ERROR

The general idea of "match this thing, but not in this condition" can be achieved with this pattern:
(don't match this(*SKIP)(*FAIL)|match this)
In your case, you'd want something like...
(
(['"`]) # capture this quote character
(?:\\.|(?!\1).)*+ # any escaped character, or
# any character that isn't the captured one
\1 # the captured quote again
(*SKIP)(*FAIL) # ignore this
|
\?\?? # one or two question marks
|
::?\w+ # word characters marked with one or two colons
)x
https://regex101.com/r/HnTVXx/7

Same idea to skip quoted parts (the (*SKIP)(*F) combo), but also 2 techniques to reduce the regex engine work:
the first character discrimination
the unrolled pattern
These 2 techniques have something in common: limiting the cost of alternations.
The first character discrimination is useful when your pattern starts with an alternation. The problem with an alternation at the beginning is that each branch should be tested so that a position where the pattern fails is identified. Since most of the time, there are many failing positions in a string, discarding them quickly constitutes a significant improvement.
For instance, something like: "...|'...|`...|:... can also be written like this:
(?=["'`:])(?:"...|'...|`...|:...)
or
["'`:](?:(?<=")...|(?<=')...|(?<=`)...|(?<=:)...)
This way, each position that doesn't start with one of these characters ["'`:] is immediately rejected with the first token without to test each branch.
The unrolled pattern consists to rewrite something like: " (?:[^"\\]|\\.)* " into:
" [^"\\]* (?: \\. [^"\\]* )* "
Note that this design eliminates the alternation and reduces the number of steps drastically:basicunrolled
Using these 2 techniques, your pattern can be written like this:
~
[`'"?:]
(?:
(?<=`) [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` (*SKIP) (*F)
|
(?<=') [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' (*SKIP) (*F)
|
(?<=") [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " (*SKIP) (*F)
|
(?<=\?) \??
|
(?<=:) :? ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~x
demo
Other way: instead of using an alternation (improved or not) at the beginning, you can build a pattern that matches all the string with contiguous results. The general design is:
\G (all I don't want) (*SKIP) \K (what I am looking for)
\G is an anchor that matches either the position after the previous result or the start of the string. Starting a pattern with it ensures that all the matches are contiguous. In this situation (at the beginning of the pattern and in factor to the whole pattern), you can also replace it with the A modifier.
That gives:
~
[^`'"?:]*
(?:
` [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` [^`'"?:]*
|
' [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' [^`'"?:]*
|
" [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " [^`'"?:]*
)*
\K # only the part of the match after this position is returned
(*SKIP) # if the next subpattern fails, the contiguity is broken at this position
(?:
\?{1,2}
|
:{1,2} ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~Ax
demo

Related

PHP regex performance

I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.
For example I have:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
Untidy:
For starters your original PHP statement:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:
Tidy:
$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
^ # Anchor to start of string.
\b # String must begin with a word char.
( # $1: Unnecessary group.
([a-zA-Z]{1,3})? # $2: Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
(\.|\-|X) # $3: Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
(\s|\.|\-)? # $4: Optional whitespace, dot or hyphen.
(X|x)? # $5: Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
( # $6: Optional ??? from 2 alternatives.
([0-9]{1,3})? # Either a1of2 $7: Optional 1-3 digits.
(X[0-9]{1,3}) # $8: X and 1-3 digits.
| ( # Or a2of2 $9: one ??? from 2 alternatives.
\s[0-9]\/[0-9] # Either a1of2.
| \/[0-9]{1,3} # Or a2of2.
) # End $9: one ??? from 2 alternatives.
)? # End $6: optional ??? from 2 alternatives.
( # $10: Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
\/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End $10: Optional sequence
) # End $1: Unnecessary group.
\s # End with a single whitespace char.
/x", ' ', $itemFullName, -1, $sum);
Critique:
This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:
There are three groups of alternatives where each of the alternatives consists of only one character - each of these can be replaced with a simple character class.
There are 10 capture groups but preg_replace uses none of the captured data. These capture groups can be changed to be non-capturing.
There are several unnecessary groups which can be simply removed.
Group 2: ([a-zA-Z]{1,3})? can be written more simply as: [a-zA-Z]{0,3}. Group 7 has a similar construct.
The \b word boundary at the start is unnecessary.
With PHP, its best to enclose regex patterns inside single quoted strings. Double quoted strings have many metacharacters that must be escaped. Single quoted strings only have two: the single quote and the backslash.
There are a few unnecessarily escaped forward slashes.
Note also that you are using the $sum variable to count the number of replacements being made by preg_replace(). Since you have a ^ start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm' multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum), so I've added the 'm' modifier.
Here is an improved version incorporating these changes:
Tidier:
$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
^ # Anchor to start of string.
[a-zA-Z]{0,3} # Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
[.X-] # Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
[\s.-]? # Optional whitespace, dot or hyphen.
[Xx]? # Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
(?: # Optional ??? from 2 alternatives.
[0-9]{0,3} # Either a1of2: Optional 1-3 digits
X[0-9]{1,3} # followed by X and 1-3 digits.
| (?: # Or a2of2: One ??? from 2 alternatives.
\s[0-9]/[0-9] # Either a1of2.
| /[0-9]{1,3} # Or a2of2.
) # End one ??? from 2 alternatives.
)? # End optional ??? from 2 alternatives.
(?: # Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End optional sequence
\s # End with a single whitespace char.
%xm', ' ', $itemFullName, -1, $sum);
Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.
Hope this helps.
Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.
My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Keep the regular expressions in separate preg_replace() calls because this gives you more maintainability, readability and efficiency.
Using a lot of OR operators | in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's | list.
Please don't worry about "fastest" without having first done some sort of measurement that it matters. Unless your program is operating too slowly, and you've run a profiler like XDebug to determine that the regex matching is the bottleneck, then you're doing premature optimization.
Rather than worrying about fastest, think about which way is clearest.

Regular expression works in Javascript but not PHP preg_match

Regular expression:
/([^]+):([^\\r\\n]+)/
String:
f1:aaa\r\nf2:bbb\r\nf3:ccc\r\nf4:ddd
According to regexpal.com, this would give my desired sets: f1 & aaa, f2 & bbb, f3 & ccc etc.
But using http://www.functions-online.com/preg_match.html I only see [0] => "f1" and [1] => "f1"
Can anyone show how I should be doing this?
Some implementations of javascript allow [] and [^] as "no character" and "any character" respectively. But keep in mind that this is particular to the javascript regex flavour. (if your are interested by the subject you can take a look at this post.)
In other words [^] is a shortcut for [\s\S] since javascript doesn't have a dotall or singleline mode where the dot can match newlines.
Thus, to obtain the same result in PHP you must replace [^] by . (which by default matches any character except newline) with the singleline modifier s after the end delimiter or (?s) before the . to allow newlines too. Examples: /.+/s or /(?s).+/
But for your particular case this pattern seems to be more appropriate:
preg_match_all('~((?>[^rn\\\:]++|(?<!\\\)[rn])+):([^\\\]++)~', $subject, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo $match[1].' '.$match[2].'<br/>';
}
pattern explanation:
~ # pattern delimiter
( # open the first capturing group
(?> # open an atomic group
[^rn\\\:]++ # all characters that are not "r", "n", "\" or ":"
| # OR
(?<!\\\)[rn] # "r" or "n" not preceded by "\"
)+ # close the atomic group and repeat one or more times
) # close the first capturing group
:
( # open the second capturing group
[^\\\]++ # all characters except "\" one or more times
) # close the second capturing group
~
Notices:
When you want to represent a \ (backslash) in a string surrounded by single quotes, you must use a double escape: \\\
The principe of this pattern is to use negative character classes and negative assertions, in other words it looks for what the desired substrings can not be.
The above pattern use atomic groups (?>...) and possessive quantifiers ++ in place of non-capturing group (?:...) and simple quantifiers +. It is the same except that the regex engine can't go back to test other ways when it fails with atomic groups and possessive quantifiers, since it doesn't record backtrack positions. You can win in performance with this kind of features.
Try with:
/([a-z0-9]+):([a-z0-9]+)(?:\r\n)?/
or
/(\w+):(\w+)(?:\r\n)?/
I think you need:
/([^:]+):([^\\r\\n]+)/
//__^ note the colon

Regular expression for template engine?

I'm learning about regular expressions and want to write a templating engine in PHP.
Consider the following "template":
<!DOCTYPE html>
<html lang="{{print("{hey}")}}" dir="{{$dir}}">
<head>
<meta charset="{{$charset}}">
</head>
<body>
{{$body}}
{{}}
</body>
</html>
I managed to create a regex that will find anything except for {{}}.
Here's my regex:
{{[^}]+([^{])*}}
There's just one problem. How do I allow the literal { and } to be used within {{}} tags?
It will not find {{print("{hey}")}}.
Thanks in advance.
This is a pattern to match the content inside double curly brackets:
$pattern = <<<'LOD'
~
(?(DEFINE)
(?<quoted>
' (?: [^'\\]+ | (?:\\.)+ )++ ' |
" (?: [^"\\]+ | (?:\\.)+ )++ "
)
(?<nested>
{ (?: [^"'{}]+ | \g<quoted> | \g<nested> )*+ }
)
)
{{
(?<content>
(?:
[^"'{}]+
| \g<quoted>
| \g<nested>
)*+
)
}}
~xs
LOD;
Compact version:
$pattern = '~{{((?>[^"\'{}]+|((["\'])(?:[^"\'\\\]+|(?:\\.)+|(?:(?!\3)["\'])+)++\3)|({(?:[^"\'{}]+|\g<2>|(?4))*+}))*+)}}~s';
The content is in the first capturing group, but you can use the named capture 'content' with the detailed version.
If this pattern is longer, it allows all that you want inside quoted parts including escaped quotes, and is faster than a simple lazy quantifier in much cases.
Nested curly brackets are allowed too, you can write {{ doThat(){ doThis(){ }}}} without problems.
The subpattern for quotes can be written like this too, avoiding to repeat the same thing for single and double quotes (I use it in compact version)
(["']) # the quote type is captured (single or double)
(?: # open a group (for the various alternatives)
[^"'\\]+ # all characters that are not a quote or a backslash
| # OR
(?:\\.)+ # escaped characters (with the \s modifier)
| #
(?!\g{-1})["'] # a quote that is not the captured quote
)++ # repeat one or more times
\g{-1} # the captured quote (-1 refers to the last capturing group)
Notice: a backslash must be written \\ in nowdoc syntax but \\\ or \\\\ inside single quotes.
Explanations for the detailed pattern:
The pattern is divided in two parts:
the definitions where i define named subpatterns
the whole pattern itself
The definition section is useful to avoid to repeat always the same subpattern several times in the main pattern or to make it more clear. You can define subpatterns that you will use later in this space: (?(DEFINE)....)
This section contains 2 named subpatterns:
quoted : that contains the description of quoted parts
nested : that describes nested curly brackets parts
detail of nested
(?<nested> # open the named group "nested"
{ # literal {
## what can contain curly brackets? ##
(?> # open an atomic* group
[^"'{}]+ # all characters one or more times, except "'{}
| # OR
\g<quoted> # quoted content, to avoid curly brackets inside quoted parts
# (I call the subpattern I have defined before, instead of rewrite all)
| \g<nested> # OR curly parts. This is a recursion
)*+ # repeat the atomic group zero or more times (possessive *)
} # literal }
) # close the named group
(* more informations about atomic groups and possessive quantifiers)
But all of this are only definitions, the pattern begins really with: {{
Then I open a named capture group (content) and I describe what can be found inside, (nothing new here).
I use to modifiers, x and s. x activates the verbose mode that allows to put freely spaces in the pattern (useful to indent). s is the singleline mode. In this mode, the dot can match newlines (it can't by default). I use this mode because there is a dot in the subpattern quoted.
You can just use "." instead of the character classes. But you then have to make use of non-greedy quantifiers:
\{\{(.+?)\}\}
The quantifier "+?" means it will consume the least necessary number of characters.
Consider this example:
<table>
<tr>
<td>{{print("{first name}")}}</td><td>{{print("{last name}")}}</td>
</tr>
</table>
With a greedy quantifier (+ or *), you'd only get one result, because it sees the first {{ and then the .+ consumes as many characters as it can as long as the pattern is matched:
{{print("{first name}")}}</td><td>{{print("{last name}")}}
With a non-greedy one (+? or *?) you'll get the two as separate results:
{{print("{first name}")}}
{{print("{last name}")}}
Make you regex less greedy using {{(.*?)}}.
I figured it out. Don't ask me how.
{{[^{}]*("[^"]*"\))?(}})
This will match pretty much anything.. like for example:
{{print("{{}}}{{{}}}}{}}{}{hey}}{}}}{}7")}}

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

How does this PCRE pattern detect palindromes?

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Categories