"vertical" regex matching in an ASCII "image" - php

Note: This is a question about possibilities of modern regex flavors. It's not about the best way to solve this using other methods. It's inspired by an earlier question, but that one is not restricted to regex.
The Problem
In an ASCII "image"/art/map/string like:
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
I'd like to find a simple vertical line formation of three Xs:
X
X
X
The number of lines is variable in the image, and the width of each line is variable too.
The Question(s)
With regex (PCRE/PHP, Perl, .NET or similar) is it possible to:
Determine if such a formation exists
Count the number of such formations/match the starting point of them all (4 in the above example)

Answer to question 1
To answer the first question one could use:
(?xm) # ignore comments and whitespace, ^ matches beginning of line
^ # beginning of line
(?:
. # any character except \n
(?= # lookahead
.*+\n # go to next line
( \1?+ . ) # add a character to the 1st capturing group
.*+\n # next line
( \2?+ . ) # add a character to the 2nd capturing group
)
)*? # repeat as few times as needed
X .*+\n # X on the first line and advance to next line
\1?+ # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+\n # X on the 2nd line and advance to next line
\2?+ # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X # X on the 3rd line
Online demo
This expression works in Perl, PCRE, Java and should work in .NET.
The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").
\1?+ means if \1 matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) \1 ). Which means match \1 if \1 is defined.
polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)
Answer to question 2
Plain matching
When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:
It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).
Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.
(Semi?)proof
Assume that no variable length lookbehinds are available.
You have to in some way count the number of characters on a line before an X.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.
Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.
Length/indirect solution
On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).
This solution is based on/inspired by m.buettner's nice "partial PCRE solution".
One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.
^
(?:
(?: # match .+? characters
.
(?= # counting the same number on the following two lines
.*+\n
( \1?+ . )
.*+\n
( \2?+ . )
)
)+?
(?<= X ) # till the above consumes an X
(?= # that matches the following conditions
.*+\n
\1?+
(?<= X )
.*+\n
\2?+
(?<= X )
)
(?= # count the number of matches
.*+\n
( \3?+ . ) # the number of matches = length of $3
)
)* # repeat as long as there are matches on this line
.*\n? # remove the rest of the line
Which in Perl could be written as:
$in =~ s/regex/$3/gmx;
$count = length $in;
Online demo
This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.
Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.
Test cases
Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).
Test #0:
--------------------
X
X
X
result: 1 (X)
Test #1:
--------------------
..X....
..X....
..X....
result: 1 (.)
Test #2:
--------------------
..X.X..
..X.X..
....X..
result: 1 (.)
Test #3:
--------------------
..X....
..X....
...X...
result: 0 ()
Test #4:
--------------------
..X....
...X...
..X....
result: 0 ()
Test #5:
--------------------
....X..
.X..X..
.X.....
result: 0 ()
Test #6:
--------------------
.X..X..
.X.X...
.X.X...
result: 1 (.)
Test #7:
--------------------
.X..X..
.X..X..
.X..X..
result: 2 (.X)
Test #8:
--------------------
XXX
XXX
XXX
result: 3 (XXX)
Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.
result: 5 (XXXXX)
Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.
result: 8 (3458.XXX)

Edit
The following solutions have two grave problems:
They can't match multiple XXX sequences starting on the same line, as the pos advances too much.
The second solution is incorrect: it matches consecutive lines where two X are above each other. There don't neccessarily have to be three in a row.
Therefore, all upvotes (and the bounty) should go to either of m.buettner's comprehensive .NET answer or the fascinating PCRE answer from Qtax himself.
Original Answer
This is an answer using embedding of Perl code into regexes. Because a Perl regex can use code to assert arbitrary conditions inside regexes or emit partial regexes, they are not limited to matching regular languages or context-free languages, but can match some parts of languages higher up in the Chomsky hierarchy.
The language you want to match can be described in regex terms as
^ .{n} X .*\n
.{n} X .*\n
.{n} X
where n is a number. This is about as complex as matching the anbncn language which is the canonical example for a context-sensitive language.
We can match the first line easily, and use some Perl code to emit the regex for the other lines:
/^ (.*?) X
(?: .*\n (??{"." x length($1)}) X){2}
/mx
That was short! What does it do?
^ (.*?) X anchores at the start of a line, matches as few non-newline characters as possible and then the X. We remember the line up to the X as capture group $1.
We repeat a group two times which matches the rest of the line, a newline, and then injects a regex that matches a string of the same length as $1. After that, there must be an X.
This regex will now match every string that has three X on top of each other.
If we want to extract all such sequences, we'll have to be nifty. Because sequences may overlap, e.g.
.X
XX
XX
X.
the position where the next match starts must not proceed past the first X. We can do this via a lookbehind and lookahead. Perl only supports constant-length lookbehind, but has the \K escape which provides similar semantics. Thus
/^ (.*?) \K X
(?=( (?: .*\n (??{"."x length($1)}) X ){2} ))
/gmx
will match every sequence of three vertical Xes. Testing time:
$ perl -E'my$_=join"",<>; say "===\n$1X$2" while /^(.*?)\KX(?=((?:.*\n(??{"."x length($1)})X){2}))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXX
===
X.X...X..X.....
X....XXXXXX.....
X
===
X....XXXXXX.....
X..XXX...........
.....X
===
..............X
..X...........X....
..X...........X
Note: this relies on experimental regex features that are available from at least Perl 5, v10 onward. The code was tested with a v16 perl.
Solution without embedded code
Let us look at the lines
...X...\n
...X..\n
We want to assert that the leading ... part of each line is of same length. We can do so by recursion with base case X.*\n:
(X.*\n|.(?-1).)X
If we anchor that at the start of a line, we can match two vertical Xes. To match more than two lines, we have to do the recursion in a lookahead and then advance the match position to the next line, and repeat. For this, we simply match .*\n.
This results in the following regex which can match a string with three vertical Xes:
/ ^
(?:
(?=( X.*\n | .(?-1). ) X)
.*\n # go to next line
){2}
/mx
But this isn't good enough, as we want to match all such sequences. To do this, we essentially put the whole regex into a lookahead. The regex engine makes sure to advance the position every time to produce a new match.
/ ^
(?=
(
(?:
(?= (X.*\n | .(?-1). ) X)
.*\n # go to next line
){2}
.* # include next line in $1
)
)
/mx
Testing time:
$ perl -E'my$_=join"",<>; say "===\n$1" while /^(?=((?:(?=(X.*\n|.(?-1).)X).*\n){2}.*))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
===
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
===
X....XXXXXX.....
X..XXX...........
.....X..........
===
..............X
..X...........X....
..X...........X....X...
So this works as well as the solution with embedded code, that is, it matches each group of lines with vertical Xes, not each group of Xes. (Actually, this solution seems more fragile to me than embedded code)

First of all: brilliant question. I think it can be very instructive to try to take regex engines to their limits.
The basic .NET solution
You guys said in the comments that it would be easy with .NET, but since there is no answer for that yet, I thought I'd write one.
You can solve both question 1. and 2. using .NET's variable-length lookbehind and balancing groups. Most of the work is done by the balancing groups, but the variable-length lookbehind is crucial to be able to detected multiple matches starting on the same line.
Anyway, here is the pattern:
(?<= # lookbehind counts position of X into stack
^(?:(?<a>).)* # push an empty capture on the 'a' stack for each character
# in front of X
) # end of lookbehind
X # match X
(?=.*\n # lookahead checks that there are two more Xs right below
(?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
# element onto 'b' and consume a character
(?(a)(?!)) # make sure that stack 'a' is empty
X.*\n # match X and the rest of the line
(?:(?<-b>).)* # while we can pop an element from stack 'b', and consume
# a character
(?(b)(?!)) # make sure that stack 'b' is empty
X # match a final X
) # end of lookahead
This pattern has to be used with RegexOptions.Multiline for the ^ to match the beginnings of lines (and obviously with RegexOptions.IgnorePatternWhitespace for freespacing mode to work).
Here are some additional comments:
By putting everything except the initial X into lookarounds, we have no problems with overlapping matches or even matches starting on the same line. However, the lookbehind has to be of variable-length which certainly constrains any solution of this kind to .NET.
The rest relies on a good grasp on balancing groups. I won't go into this in detail here, because it makes for quite long answers in itself. (see MSDN and this blog post for even more information)
The lookbehind just matches ^.*, so everything until the start of the line, but for every . we push an empty capture onto stack a, thereby counting the position of our X as the size of the stack.
Then after consuming the rest of the line in the lookahead, we match again just .*, but before consuming each ., we pop one element from stack a (which leads to failure, once a is empty) and push an empty capture onto b (so that we don't forget how many characters there have to be for the third line).
To make sure that we really empty the entire stack, we use (?(a)(?!)). This is a conditional pattern, that tries to match (?!) if stack a is not empty (and is simply skipped otherwise). And (?!) is an empty negative lookahead, which always fails. Hence, this simply encodes, "is a not empty? fail. otherwise, continue".
Now that know we've consumed exactly the right amount of characters in the new line, we try to match a X and the rest of the line. Then we repeat the same process again with stack b. Now there is no need to push onto any new stack, because if this works, we're done. We check that b is empty after this, and match a third X.
Finally, an optimization side note: this pattern still works if all repetitions are wrapped in atomic groups (thereby emulating possessive quantifiers, which are not supported by .NET)! This would save a lot of backtracking. Moreover, if we put at least the stack-popping quantifiers in atomic groups, we can get rid of both (?(...)(?!)) checks (as these are only needed for cases, where the preceding repetition had to backtrack).
The full .NET solution
(Only the bravest of adventurers should follow me into the really dark cave I'm about to descend into...)
As discussed in the comments, this solution has one drawback: it counts overlapping matches. E.g.
..X..
..X..
..X..
..X..
Gives two matches, one in the first and one in the second line. We'd like to avoid this, and report only one match (or two if there are 6 to 8 Xs and three if there are 9 to 11 Xs and so on). Moreover, we want to report the matches at the 1st, 4th, 7th, ... X.
We can adjust the above pattern to allow for this solution by requiring that the first X is preceded by an integer multiple of 3 other Xs that statisfy our requirements. The basic idea of checking this uses the same stack manipulation as before (except we shift things around between 3 stacks so that after finding three Xs we end up where we started). To do this we have to fiddle a bit with the lookbehind.
There is a catch though. .NET's variable-length lookbehind uses another .NET-unique feature, RightToLeftMode, in which the pattern is read (and matched) from right to left. Normally this doesn't need to bother us, but when we combine this with balancing groups, we might be in for some unpleasant surprises. In particular, when considering how our capture stacks evolve, we need to construct (and read) the expression from right to left (or bottom to top) as well.
So when you read the following expression (and my annotations), start at the end of the outermost lookbehind (you'll have to scroll a bit) - i.e. just before the only top-level X; then read all the way up to the top. And then continue after the lookbehind.
(?<=
# note that the lookbehind below does NOT affect the state of stack 'a'!
# in fact, negative lookarounds can never change any capturing state.
# this is because they have to fail for the engine to continue matching.
# and if they fail, the engine needs to backtrack out of them, in which
# case the previous capturing state will be restored.
(?<! # if we get here, there is another X on top of the last
# one in the loop, and the pattern fails
^ # make sure we reached the beginning of the line
(?(a)(?!)) # make sure that stack 'a' is empty
(?:(?<-a>).)* # while we can pop an element from stack 'a', and consume
# a character
X.*\n # consume the next line and a potential X
)
# at this point we know that there are less than 3 Xs in the same column
# above this position. but there might still be one or two more. these
# are the cases we now have to eliminate, and we use a nested negative
# lookbehind for this. the lookbehind simply checks the next row and
# asserts that there is no further X in the same column.
# this, together with the loop, below means that the X we are going to match
# is either the topmost in its column or preceded by an integer multiple of 3
# Xs - exactly what we are looking for.
(?:
# at this point we've advanced the lookbehind's "cursor" by exactly 3 Xs
# in the same column, AND we've restored the same amount of captures on
# stack 'a', so we're left in exactly the same state as before and can
# potentially match another 3 Xs upwards this way.
# the fact that stack 'a' is unaffected by a full iteration of this loop is
# also crucial for the later (lookahead) part to work regardless of the
# amount of Xs we've looked at here.
^ # make sure we reached the beginning of the line
(?(c)(?!)) # make sure that stack 'a' is empty
(?:(?<-c>)(?<a>).)* # while we can pop an element from stack 'c', push an
# element onto 'a' and consume a character
X.*\n # consume the next line and a potential X
(?(b)(?!)) # make sure that stack 'b' is empty
(?:(?<-b>)(?<c>).)* # while we can pop an element from stack 'b', push an
# element onto 'c' and consume a character
X.*\n # consume the next line and a potential X
(?(a)(?!)) # make sure that stack 'a' is empty
(?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
# element onto 'b' and consume a character
X.*\n # consume the next line and a potential X
)* # this non-capturing group will match exactly 3 leading
# Xs in the same column. we repeat this group 0 or more
# times to match an integer-multiple of 3 occurrences.
^ # make sure we reached the beginning of the line
(?:(?<a>).)* # push an empty capture on the 'a' stack for each
# character in front of X
) # end of lookbehind (or rather beginning)
# the rest is the same as before
X # match X
(?=.*\n # lookahead checks that there are two more Xs right below
(?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
# element onto 'b' and consume a character
(?(a)(?!)) # make sure that stack 'a' is empty
X.*\n # match X and the rest of the line
(?:(?<-b>).)* # while we can pop an element from stack 'b', and consume
# a character
(?(b)(?!)) # make sure that stack 'b' is empty
X # match a final X
) # end of lookahead
Working demo on RegexHero.net.
I interspersed all explanation right with the pattern this time. So if you read the pattern in the way I recommended above, you get the explanation right when you need it...
Now that was one hell of a beast. But it satisfies the entire specification now and shows off just how powerful .NET's regex flavor really is. And, although this seems quite horrible, I think (once you realise the right-to-left-thing) this is much more easily understandable than a comparable solution with PCRE (using recursion or otherwise).
As Kobi mentioned in a comment below, this could be shortened a good bit, if you accept that your results are found in multiple captures of a single match (e.g., if you have a column of 7 Xs you only get one match, but with 2 captures in a certain group). You can do this by repeating the main (lookahead) part 1 or more times and capturing the initial X (put everything in a lookahead though). Then the lookbehind does not need to count off triples of Xs, but only has to check that there is no leading X. This would probably cut the size of the pattern in half.
The partial PCRE solution
(If only the bravest of adventurers followed me through the last solution, I am probably only left with madmen on the next journey...)
To prove what I just said about how the above solution compares to PCRE, let's look at how we can even remotely solve the full problem in PCRE. We'll have to work a good bit harder without variable-length lookbehinds and balancing groups.
Qtax (the OP) provided a brilliant solution to his first question (checking whether the string contains any X-column) using self-referencing groups to count. This is a very elegant and compact solution. But because each match goes from the beginning of the line to the X that starts the column, and matches cannot overlap, we can't get multiple matches per line. We could try to put everything in a lookahead (so that nothing is actually matched), but two zero-width matches will also never start at the same position - so we'll still get only one match per candidate line.
However it is indeed possible to solve at least the first part of question 2 with PCRE: count the number of columns starting in each line (and hence to total amount of X columns). Since we cannot get this count in the form of individual matches (see previous paragraph), and we cannot get this count in the form of individual groups or captures (since PCRE provides only a fixed and finite number of captures, as opposed to .NET). What we can do instead is to encode the number of columns in the matches.
Here is how: for each line we check if there's a column starting. If so, we include one character in a certain capturing group. Then, before reporting a successful match, we try to find as many further columns as possible - for each one adding a character to that particular group. By doing this, we encode the number of columns starting in each line in the length of that particular capture.
Actually realizing this concept in a regex is a lot more complicated than it may first sound (and it already sounds quite complicated). Anyway, here it is:
^
(?:(?|
(?(5)(?![\s\S]*+\5))
(?!(?!)()())
(?=
(?:
.
(?=
.*+\n
( \3? . )
.*+\n
( \4? . )
)
)*?
X .*+\n
\3
X .*+\n
\4
)
()
|
(?(5)(?=[\s\S]*+\5)|(?!))
(?:
.
(?=
.*+\n
( \1? .)
.*+\n
( \2? .)
)
)+?
(?=
(?<=X).*+\n
(\1)
(?<=X).*+\n
(\2)
(?<=X)
)
(?=
([\s\S])
[\s\S]*
([\s\S] (?(6)\6))
)
){2})+
(Actually, it's a bit easier than that - see Qtax's answer for how to simplify this approach. I'll leave this approach here anyway for academic reasons, as some very advanced and interesting techniques can be learned from it - see the summary at the end.)
Yes, there are no annotations. I figured, no one would actually read them anyway, so instead I'll try to break this expression down in parts (I'll go for a top-down approach).
So let's look at the outer layer of onion from hell:
^
(?:(?|
checkForNextColumn
|
countAndAdvance
){2})+
So our matches are again anchored to the beginnings of lines. Then we have a (?:...{2})+ which means an even number of repetitions of something. And that something is an alternation of two subpatterns. These subpatterns represent the steps I mentioned above. The first one checks that there is another column starting in the current line, the second one registers a count and prepares the engine's state for another application of the first subpattern. So control is given to the second pattern - the first just checks for another column using a lookahead and is hence a zero-width pattern. This is why I cannot simply wrap everything in + but have to do the {2})+ thing - otherwise the zero-width component would be tried only once; that's a necessary optimization applied by pretty much all engines to avoid infinite loops with patterns like (a*)+.
There is one more (very important detail): I used (?|...) for the alternation. In this kind of grouping, each alternative starts with the same group number. Hence in /(?|(a)|(b))/ both a and b can be captured into group 1. This is the crucial trick that allows "communication" between subpatterns, as they can modify the same groups.
Anyway... so we have these two subpatterns. We'd like to make sure that control really alternates between them. So that each group fails if it was the last one that matched. We do this by wrapping the pattern in some grouping-and-referencing magic:
^(?:(?|
(?(5)(?![\s\S]*+\5)) # if group 5 has matched before make sure that
# it didn't match empty
checkForNextColumn # contains 4 capturing groups
() # this is group 5, match empty
|
(?(5)(?=[\s\S]*+\5)|(?!)) # make sure that group 5 is defined and that it
# matched empty
advanceEngineState # contains 4 capturing groups
(?=
([\s\S]) # this is group 5, match non-empty
[\s\S]* # advance to the end very end of the string
([\s\S] (?(6)\6)) # add a character from the end of the string to
# group 6
)
){2})+
So at the end of each alternative, we'll invalidate the condition for this alternative to even start matching. At the end of the second alternative we'll also include a character into group 6, using the technique outlined by Qtax. This is the counting step. I.e., group 6 will contain as many characters as there are columns starting in the current line.
Now checkForNextColumn will really just be Qtax's solution inside a lookahead. It needs one more modification though and to justify this we'll look into advanceEngineState first.
Let's think about how we would want to modify the state, for Qtax's solution to match a second column in a row. Say we have input
..X..X..
..X..X..
..X..X..
and we want to find the second column. This could be accomplished, by starting the match from the position just after the first X and having groups \1 and \2 already initialised to the first three characters (..X) of rows 2 and 3, respectively (instead of them being empty).
Now let's try to do this: match everything up to and including the next X that starts a column, then fill two groups with the corresponding line-prefixes for use in the checkForNextColumn pattern. This is again pretty much Qtax's pattern, except that we count the X in (instead of stopping right before it), and that we need to add the capturing into a separate group. So here is advanceEngineState:
(?:
.
(?=
.*+\n
( \1? .)
.*+\n
( \2? .)
)
)+?
(?=
(?<=X) .*+\n
(\1)
(?<=X) .*+\n
(\2)
(?<=X)
)
Note how I turned the Xs into lookbehinds, to go one character further, and how I effectively copy the final contents of \1 into \3 and those of \2 into \4.
So if we now use Qtax's solution as checkForNextColumn in a lookahead, using groups \3 and \4 instead of \1 and \2, we should be done.
But just how do we make those groups \3 and \4 instead of \1 and \2? We could start the pattern with ()(), which would always match, without affecting the engine's cursor, but increase the group count by 2. However, this is problematic: this resets groups 1 and 2 to empty strings, which if we find a second column, advanceEngineState will be in an inconsistent state (as the engine's global position has been advanced, but the counting groups are zero again). So we want to get those two groups into the pattern, but without affecting what they are currently capturing. We can do this by utilizing something I already mentioned with the .NET solutions: groups in negative lookarounds do not affect the captured contents (because the engine needs to backtrack out of the lookaround to proceed). Hence we can use (?!(?!)()()) (a negative lookahead that can never cause the pattern to fail) to include two sets of parentheses in our pattern, that are never used. This allows us to work with groups 3 and 4 in our first subpattern, while keeping groups 1 and 2 untouched for the second subpatterns next iteration. In conclusion this is checkForNextColumn:
(?!(?!)()())
(?=
(?:
.
(?=
.*+\n
( \3? . )
.*+\n
( \4? . )
)
)*?
X .*+\n
\3
X .*+\n
\4
)
Which, for the most part actually looks really familiar.
So this is it. Running this against some input will give us a group 6 which contains one capture for each line that has a column starting - and the capture's length will tell us how many columns started there.
Yes, it really works (live demo).
Note that this (like the basic .NET solution) will overcount columns that are more than 3 Xs long. I suppose it is possible to correct this count with lookaheads (in a similar way to the lookbehind of the full .NET solution), but this is left as an exercise to the reader.
It's a bit unfortunate that the base problem of this solution is already very convoluted and bloats the solution (75% of the lines are mostly just copies of Qtax's solution). Because the surrounding framework has some really interesting techniques and lessons:
We can have multiple subpatterns that accomplish specific matching/counting tasks, and have them "communicate" through mutual capturing groups, by putting them in a (?|...) alternation and looping over them.
We can force zero-width alternatives to be carried out over and over again by wrapping them in a finite quantifier like {2} before putting everything into +.
Group numbers can be skipped in one subpattern (without affecting the captured contents) by putting them into a never-failing negative lookahead like (?!(?!)()).
Control can be passed back and forth between subpatterns by capturing something or nothing in a certain group that is checked upon entering the alternation.
This allows for some very powerful computations (I've seen claims that PCRE is in fact Turing-complete) - although this is certainly the wrong approach for productive use. But still trying to understand (and come up) with such solutions can be a very challenging and somehow rewarding exercise in problem solving.

If you want to find a single "vertical" pattern, here's a solution. If you want to also match a "horizontal" pattern, try doing it with a separate match, perhaps checking for overlapping match positions. Remember that the computer has not idea what a line is. It's an arbitrary thing made up by humans. The string is just a one-dimensional sequence where we denote some character(s) to be a line ending.
#!/usr/local/perls/perl-5.18.0/bin/perl
use v5.10;
my $pattern = qr/XXX/p;
my $string =<<'HERE';
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
HERE
$transposed = transpose_string( $string );
open my $tfh, '<', \ $transposed;
while( <$tfh> ) {
while( /$pattern/g ) {
my $pos = pos() - length( ${^MATCH} );
push #found, { row => $pos, col => $. - 1 };
pos = $pos + 1; # for overlapping matches
}
}
# row and col are 0 based
print Dumper( \#found ); use Data::Dumper;
sub transpose_string {
my( $string ) = #_;
open my $sfh, '<', \ $string;
my #transposed;
while( <$sfh> ) {
state $row = 0;
chomp;
my #chars = split //;
while( my( $col, $char ) = each #chars ) {
$transposed[$col][$row] = $char;
}
$row++;
}
my #line_end_positions = ( 0 );
foreach my $col ( 0 .. $#transposed ) {
$transposed .= join '', #{ $transposed[$col] };
$transposed .= "\n";
}
close $sfh;
return $transposed;
}

Working Solution to Question 2
It's fully possible in Perl/PCRE! :)
Sorry, I'm a little extremely late to the party, but I'd just like to point out that you can in fact count the number of XXX formations found; that is, structure the expression such that there is exactly one match per formation when performing a global match. It's quite tricky, though.
Here it is:
\A(?:(?=(?(3)[\s\S]*(?=\3\z))(?|.(?=.*\n(\1?+.).*\n(\2?+.))|.*\n()())+?(?<=X)(?=.*\n\1(?<=X).*\n\2(?<=X))(?=([\s\S]*\z)))(?=[\s\S]*([\s\S](?(4)\4)))[\s\S])+[\s\S]*(?=\4\z)|\G(?!\A|[\s\S]?\z)
In action on regex101
I should probably add some comments to that! Here, for those interested:
\A(?:
(?=
(?(3)[\s\S]*(?=\3\z)) # Resume from where we ended last iteration
(?| # Branch-reset group used to reset \1
.(?=.*\n(\1?+.).*\n(\2?+.)) # and \2 to empty values when a new line
| # is reached. ".*\n" is used to skip the
.*\n()() # rest of a line that is longer than the
)+? # ones below it.
(?<=X)(?=.*\n\1(?<=X).*\n\2(?<=X)) # Find a XXX formation
(?=([\s\S]*\z)) # Save the rest of the line in \3 for
) # when we resume looking next iteration
(?=[\s\S]*([\s\S](?(4)\4))) # For every formation found, consume another
# character at the end of the subject
[\s\S] # Consume a character so we can move on
)+
[\s\S]*(?=\4\z) # When all formations around found, consume
# up to just before \4 at the subject end.
|
\G(?!\A|[\s\S]?\z) # Now we just need to force the rest of the
# matches. The length of \4 is equal to the
# number of formations. So to avoid overmatching,
# we need to exclude a couple of cases.
I'm scared; what's happening??
Basically, we're passing through the entire subject in one repeated non-capture group, moving from one XXX formation to the next. For each formation found, tack another character onto a counter at the end of the subject (stored in \4). There were a few hurdles to overcome:
If we're matching everything in one go, how can we move from one line to the next? Solution: use a branch-reset group to reset \1 and \2 when a new line is encountered.
What if we have a large ASCII grid with a small "\nX\nX\nX" at the end? If we consume the subject from one formation to the next, we will start eating into our counter. Solution: consume only one character at a time, and wrap the formation-seeking logic in a lookahead.
But then how do we know where to resume looking on the next iteration of the non-capture group if we're not consuming from one formation to the next? Solution: when a formation is found, capture the characters from that position to the very end of the subject - a fixed point that can always be referred to. This is the same trick I used to match nested brackets without recursion, which really exemplifies the power of forward references.
This was a lot of fun and I would love to see more posts like this!

You could rotate the image, and then search for XXX.

My approach to match vertical patterns using PHP.
First of all, let's rotate our input by 90°:
// assuming $input contains your string
$input = explode("\n", $input);
$rotated = array();
foreach ($input as $line)
{
$l = strlen($line);
for ($i = 0; $i < $l; $i++)
{
if (isset($rotated[$i]))
$rotated[$i] .= $line[$i];
else
$rotated[$i] = $line[$i];
}
}
$rotated = implode("\n", $rotated);
This results in
..XXX.....
..........
.XX....XX.
....X.....
X...X....X
.X.XXX....
..XX......
...X......
...X......
.XXX......
...X.....
.........
........
........
....XXX
.....
...
..
..
X
.
.
.
Now this might look strange, but actually brings us closer since we can now simply preg_match_all() over it:
if (preg_match_all('/\bXXX\b/', $rotated, $m))
var_dump($m[0]);
et voila:
array(4) {
[0] =>
string(3) "XXX"
[1] =>
string(3) "XXX"
[2] =>
string(3) "XXX"
[3] =>
string(3) "XXX"
}
If you also want to match the same horizontal pattern, simply use the same expression on the non-rotated input.

Related

Using RegEx to find a string (as variable number)

How can I find numbers inside certain strings in php?
For example, having this text inside a page, I would like to find for
|||12345|||
or
|||354|||
I'm interested in the numbers, they always change according to the page I visit (numbers being the id of the page and 3-5 characters length).
So the only thing I know for sure is those pipes surrounding the numbers.
Thanks in advance.
Using this \|\|\|\K\d{3,5}(?=\|\|\|)
gives many advantages.
https://regex101.com/r/LtbKfM/1
First, three literals without a quantifier is a simple strncmp() c
call. Also, anytime a regex starts with an assertion it is
inherently slower. Therefore, this is the fastest match for the 3
leading pipe symbols.
Second, using the \K construct excludes whatever was previously
matched from group 0. We don't want to get the 3 pipes in the
match, but we do want to match them.
edit
Note that capture group results are not stored in a special string
buffer.
Each group is really a pointer (or offset) and a length.
The pointer (or offset) is to somewhere in the source string.
When it comes time to extract a particular group string, the overload function for braces
matches[#] uses the pointer (or offset) and length to create and return a string instance.
Using the \K construct simply sets the group 0 pointer (or offset)
to the position in the string that represents the position that
matched after the \K construct.
Third, using a lookahead assertion for 3 pipe symbols does not
consume the symbols as far as the next match is concerned. This
makes these symbols available for the next match. I.e:
|||999|||888||| would get 2 matches as would
|||999|||||888|||.
The result is an array of just the numbers.
Formatted
\|\|\| # 3 pipe symbols
\K # Exclude previous items from the match (group 0)
\d{3,5} # 3-5 digits
(?= \|\|\| ) # Assertion, not consumed, 3 pipe symbols ahead
While #S.Kablar's suggestion is pretty valid, it makes use of a syntax that may be difficult for a beginner.
The more casual way to achieve your goal would be as follows:
$text = 'your input string';
if (preg_match_all('~\|{3}(\d+)\|{3}~', $text, $matches)) {
foreach($matches[1] as $number) {
var_dump($number); // prints smth like string(3) "345"
}
}
The breakdown of the regex:
~ and ~ surround the expression
\| stands for the pipe, which is a special character in regex and must be escaped with a backslash
{3} says 'the previous (the pipe) must be present exactly three times'
( and ) enclose a subpattern so that it is stored under $matches[1]
\d requires a digit
+ says 'the previous (a digit) may be repeated but must have at least one instance'

Variable sized lookahead consume

Im a trying to use a regular expression to parse a varying string using php, that string can be, for example;
"twoX // threeY"
or
"twoX /// threeY"
So there is a left-keyword, a divider consisting of 2 or 3 slashes and a right-keyword. These are also the parts I would like to consume separately.
"/((?<left>.+)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)/";
When I use this regular expression on the first string, everything gets parsed correctly, so;
left: twoX
divider: //
right: threeY
but when I run this expression on the second string, the left and the divider don't get parsed properly. The result I then get is;
left: twoX /
divider: //
right: threeY
I do use the {2,3} in the regular expression to either select 2 or 3 slashes for the divider. But this somehow doesn't seem to work with the match-all character .
Is there a way to get the regex to parse either 2 or 3 slashes without duplicating the entire sequence?
The (.+)? is a greedy dot matching pattern and matches as many chars as possible with 1 being the minimum. So, since the next pattern requires only 2 chars, only 2 chars will be captured into the next group, the first / will belong to Group 1.
Use a lazy pattern in the first group:
'~(?<left>.*?)(?<divider>/{2,3})(?<right>.*)~'
^^^
See the regex demo. Add ^ and $ anchors around the pattern to match the whole string if necessary.
Note you do not need to repeat the same pattern in the lookahead and the consuming pattern part, it only makes the pattern cumbersome, (?=(?<divider>[\/]{2,3}))([\/]{2,3}) = (?<divider>[\/]{2,3}).
Details
(?<left>.*?) - Group "left" that matches any 0+ chars other than line break chars as few as possible
(?<divider>/{2,3}) - 2 or 3 slashes (no need to escape since ~ is used as a regex delimiter)
(?<right>.*) - Group "right" matching any 0+ chars other than line break chars as many as possible (up to the end of line).
And a more natrual-looking splitting approach, see a PHP demo:
$s = "twoX // threeY";
print_r(preg_split('~\s*(/{2,3})\s*~', $s, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
// => Array ( [0] => twoX [1] => // [2] => threeY )
You lose the names, but you may add them at a later step.
The + quantifier is greedy by default, meaning that it will try and match as many characters as possible. So you want to make the first + lazy so it will not try and match the first / by adding the ? quantifier you can make the + lazy: +?.
This will result in the following regex:
((?<left>.+?)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)

How to match multiple substrings that occur after a specific substring?

I am trying to read out the server names from a nginx config file.
I need to use regex on a line like this:
server_name this.com www.this.com someother-example.com;
I am using PHP's preg_match_all() and I've tried different things so far:
/^(?:server_name[\s]*)(?:(.*)(?:\s*))*;$/m
// no output
/^(?:server_name[\s]*)((?:(?:.*)(?:\s*))*);$/m
// this.com www.this.com someother-example.com
But I can't find the right one to list the domains as separate values.
[
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com'
]
as Bob's your uncle wrote:
(?:server_name|\G(?!^))\s*\K[^;|\s]+
Does the trick!
The plain English requirement is to extract the space-delimited strings that immediately follow server_name then several spaces.
The dynamic duo of \G (start from the start / continue from the end of the last match) and \K (restart the fullstring match) will be the heroes of the day.
Code: (Demo)
$string = "server_name this.com www.this.com someother-example.com;";
var_export(preg_match_all('~(?:server_name +|\G(?!^) )\K[^; ]+~', $string, $out) ? $out[0] : 'no matches');
Output:
array (
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com',
)
Pattern Explanation:
(?: # start of non-capturing group (to separate piped expressions from end of the pattern)
server_name + # literally match "server_name" followed by one or more spaces
| # OR
\G(?!^) # continue searching for matches immediately after the previous match, then match a single space
) # end of the non-capturing group
\K # restart the fullstring match (aka forget any previously matched characters in "this run through")
[^; ]+ # match one or more characters that are NOT a semicolon or a space
The reason that you see \G(?!^) versus just \G (which, for the record, will work just fine on your sample input) is because \G can potentially match from two different points by its default behavior. https://www.regular-expressions.info/continue.html
If you were to use the naked \G version of my pattern AND add a single space to the front of the input string, you would not make the intended matches. \G would successfully start at the beginning of the string, then match the single space, then server_name via the negated character class [^; ].
For this reason, disabling \G's "start at the start of the string` ability makes the pattern more stable/reliable/accurate.
preg_match_all() returns an array of matches. The first element [0] is a collection of fullstring matches (what is matched regardless of capture groups). If there are any capture groups, they begin from [1] and increment with each new group.
Because you need to match server_name before targeting the substrings to extract, using capture groups would mean a bloated output array and an unusable [0] subarray of fullstring matches.
To extract the desired space-delimited substrings and omit server_name from the results, \K is used to "forget" the characters that are matched prior to finding the desired substrings. https://www.regular-expressions.info/keep.html
Without the \K to purge the unwanted leading characters, the output would be:
array (
0 => 'server_name this.com',
1 => ' www.this.com',
2 => ' someother-example.com',
)
If anyone is comparing my answer to user3776824's or HamZa's:
I am electing to be very literal with space character matching. There are 4 spaces after server_name, so I could have used an exact quantifier {4} but opted for a bit of flexibility here. \s* isn't the most ideal because when matching there will always be "one or more spaces" to match. I don't have a problem with \s, but to be clear it does match spaces, tabs, newlines, and line returns.
I am using (?!^) -- a negative lookahead -- versus (?<!^) -- a negative lookbehind because it does the same job with a less character. You will more commonly see the use of \G(?!^) from experienced regex craftsmen.
There is never a need to use "alternative" syntax (|) within a character class to separate values. user3776824's pattern will actually exclude pipes in addition to semicolons and spaces -- though I don't expect any negative impact in the outcome based on the sample data. The pipe in the pattern simply should not be written.

PHP RegEx: match a list in an unknown order

I am trying to match some CSS properties. However, I cannot predict the order they will be in.
Example:
header p {
color:#f2f3ed;
background-color:#353535;
background-image: url(../images/dots.png);
}
However, I should also expect:
header p {
background-image: url(../images/dots.png);
background-color:#353535;
color:#f2f3ed;
}
And any other combination of these three properties. The web application we are building only gives me access to the preg_match function.
Anyone know a way to match every possible combination of properties? I know exactly what the properties will be, I just do not know which order they will come in.
Saying that, I am trying to find a quicker solution than typing out every possible combination and separating it with |
The super-inexact but also stupidly-easy approach would be use a list of alternatives:
/ ( \s* color:#\w+; | \s* bbb:... | \s* ccc:... ){3} /x
The quantifier {3} will ensure that three of the alternatives are present, and the order won't matter.
It would however allow three color: properties to match. You'll have to decide if that's important enough, or if it's unlikely enough that anyone will write three consecutive color: statements in your CSS declarations.
Regex for existence of some words whose order doesn't matter
This works, thanks to the lookahead operator (?= ).
/^(?=.*(?:color|background-color|background-image)\:.*?;).*/gm
or with comments
^ # beggining of line
(?= # positive lookahead (true if pattern present after current position; in this case the beginning of current line)
.* # anything before the pattern we search for
(?:color|background-color|background-image)\: # match any of those 3 sequences followed by : char
.*?; # anything followed by ; but not looking beyond the first ; found
)
.* # match the rest of the chars from where the (?= ) was found.
/gmx # MODIFIERS: global, multi-line, extended (for comments, ignores spaces and anything after #)
You can try it here:
https://regex101.com/r/tX0hK3/1
The ^ before (?=.* is very important for efficiency!
Otherwise the engine would try the search once for each position.
The ending .* (or .+ or .++, possessive) is just for matching the line now we have find the right content.

How does this PCRE pattern detect palindromes?

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Categories