Regex PHP. Reduce steps: limited by fixed width Lookbehind

Regex PHP. Reduce steps: limited by fixed width Lookbehind - php

I have a regex that will be used to match #users tags.
I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character.
And the blocks are allowed to surround a tag, just like punctuation.
I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for #, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.
How could I do my regex more efficient with fewer steps?
Here is a simplified version of it, in the Regex101 link there is the full regex.
(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*#([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])
https://regex101.com/r/lTPUOf/4/

A rule of thumb:
Do not let engine make an attempt on matching each single one character if
there are some boundaries.
The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:
(?:[^#^]++|[#^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
(\^[34biu78])*+#([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))
Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.
What is the logic?
We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^#^]++ matches up to a # or ^ symbols (desired characters) and [#^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.
You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).
See live demo here

Related

Extract all words between two phrases using regex [duplicate]

This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?

Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.

This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)

[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

Regex with negative lookahead to ignore the word "class"

I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.

If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).

It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).

Is this one is what you want ? (?<!class)(\w+ass)

Explain Regular Expression

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.

. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.

Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

In RegEx, how do you find a line that contains no more than 3 unique characters?

I am looping through a large text file and im looking for lines that contain no more than 3 different characters (those characters, however, can be repeated indefinitely). I am assuming the best way to do this would be some sort of regular expression.
All help is appreciated.
(I am writing the script in PHP, if that helps)

Regex optimisation fun time exercise for kids! Taking gnarf's regex as a starting point:
^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$
I noticed that there were nested and sequential *s here, which can cause a lot of backtracking. For example in 'abcaaax' it will try to match that last string of ‘a’s as a single \1* of length 3, a \1* of length two followed by a single \1, a \1 followed by a 2-length \1*, or three single-match \1s. That problem gets much worse when you have longer strings, especially when due to the regex there is nothing stopping \1 from being the same character as \2.
^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$
This was over twice as fast as the original, testing on Python's PCRE matcher. (It's quicker than setting it up in PHP, sorry.)
This still has a problem in that (.)? can match nothing, and then carry on with the rest of the match. \1|\2 will still match \1 even if there is no \2 to match, resulting in potential backtracking trying to introduce the \1|\2 and \1|\2|\3 clauses earlier when they can't result in a match. This can be solved by moving the ? optionalness around the whole of the trailing clauses:
^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$
This was twice as fast again.
There is still a potential problem in that any of \1, \2 and \3 can be the same character, potentially causing more backtracking when the expression does not match. This would stop it by using a negative lookahead to not match a previous character:
^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$
However in Python with my random test data I did not notice a significant speedup from this. Your mileage may vary in PHP dependent on test data, but it might be good enough already. Possessive-matching (*+) might have helped if this were available here.
No regex performed better than the easier-to-read Python alternative:
len(set(s))<=3
The analogous method in PHP would probably be with count_chars:
strlen(count_chars($s, 3))<=3
I haven't tested the speed but I would very much expect this to be faster than regex, in addition to being much, much nicer to read.
So basically I just totally wasted my time fiddling with regexes. Don't waste your time, look for simple string methods first before resorting to regex!

At the risk of getting downvoted, I will suggest regular expressions are not meant to handle this situation.
You can match a character or a set of characters, but you can't have it remember what characters of a set have already been found to exclude those from further match.
I suggest you maintain a character set, you reset it before you begin with a new line, and you add there elements while going over the line. As soon as the count of elements in the set exceeds 3, you drop the current line and proceed to the next.

Perhaps this will work:
preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail
Explaination:
/
^ #start of string
(.) #match any character in group 1
\\1* #match whatever group 1 was 0 or more times
(.)? #match any character in group 2 (optional)
(?:\\1*\\2*)* #match group 1 or 2, 0 or more times, 0 or more times
#(non-capture group)
(.)? #match any character in group 3 (optional)
(?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
#(non-capture group)
$ #end of string
/
An added benifit, $matches[1], [2], [3] will contain the three characters you want. The regular expression looks for the first character, then stores it and matches it up until something other than that character is found, catches that as a second character, matching either of those characters as many times as it can, catches the third character, and matches all three until the match fails or the string ends and the test passes.
EDIT
This regexp will be much faster because of the way the parsing engine and backtracking works, read bobince's answer for the explanation:
/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/

for me - as a programmer with fair-enough regular expression knowledge this sounds not like a problem that you can solve using Regexp only.
more likely you will need to build a hashMap/array data structure key: character value:count and iterate the large text file, rebuilding the map for each line. at each new character check if the already-encountered character count is 2, if so, skip current line.
but im keen to be suprised if one mad regexp hacker will come up with a solution.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.