How does this PCRE pattern detect palindromes? - php

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.

I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Related

Avoid possible statement with regex

How to replace a part of string with avoid a year numbers (f.e. 2019 or 2019-2020) before the first slash occurance with Regex
//something is wrong here
preg_replace('/^[a-z0-9\-]+(-20[0-9]{2}(-20[0-9]{2})?)?/', '$1', $input_lines);
Needed:
abc-def/something/else/ [incl. slash if there is not character before it]
abc-def-2019/something/else/
abc-def-2019-2020/something/else/
abc-def-125-2019/something/else/
My initial closure was insufficient to handle all requirements. Yes, you have a greedy quantifier problem, but there is more to handle.
Code: (Demo) (Regex101 Demo)
$tests = [
'abc-def/something/else/',
'abc-def-2019/something/else/',
'abc-def-2019-2020/something/else/',
'abc-def-125-2019/something/else/'
];
var_export(
preg_replace('~^(?:[a-z\d]+-?)*?(?:/|(?=20\d{2}-?){1,2})~', '', $tests)
);
Output:
array (
0 => 'something/else/',
1 => '2019/something/else/',
2 => '2019-2020/something/else/',
3 => '2019/something/else/',
)
My pattern matches alpha-numeric sequences, optionally followed by a hyphen -- a subpattern than may be repeated zero or more times ("giving back", aka non-greedy, when possible).
Then the first non-capturing group must be followed by a slash (which is matched) or a your year substrings which also may have a trailing hyphen (this is not matched, but found via a lookahead).
If this doesn't suit your real projects data, you will need to provide more and more accurate samples to test against which reveal the fringe cases.
If the forward slash has to be present and it should stop after the first occurrence of 2019 or 2020, you might use:
^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?
In separate parts that would look like
^ Start of string
(?=[a-z\d-]*/) Assert that a / is present
[a-zA-Z013-9-]+ Match 1+ times any of the listed (Note that the 2 is not listed)
(?> Atomic group
2(?!0(?:19|20)(?!\d)) Match 2 and assert what is on the right is not 019 or 020
| Or
[a-zA-Z013-9-]+ Match 1+ times any of the listed
)* Close group and repeat 0+ times
/? Match optional /
Regex demo | Php demo
Your code might look like
preg_replace('~^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?~', '', $input_lines);

Match multiple times a group only in single regex

Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
Match every figcaption with preg_replace_callback
In the callback match every instance of #hashtag.
Can I get your opinions on this? Is there a better way? my articles are not very long.
Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
More details: (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
[^<#]* - 0+ chars other than < and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
(?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
[^<#]* - 0+ chars other than < and #
\K - omit the text matched so far
#\w+ - # and 1+ word chars
Even more details:
\K:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:
Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo

How to match multiple substrings that occur after a specific substring?

I am trying to read out the server names from a nginx config file.
I need to use regex on a line like this:
server_name this.com www.this.com someother-example.com;
I am using PHP's preg_match_all() and I've tried different things so far:
/^(?:server_name[\s]*)(?:(.*)(?:\s*))*;$/m
// no output
/^(?:server_name[\s]*)((?:(?:.*)(?:\s*))*);$/m
// this.com www.this.com someother-example.com
But I can't find the right one to list the domains as separate values.
[
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com'
]
as Bob's your uncle wrote:
(?:server_name|\G(?!^))\s*\K[^;|\s]+
Does the trick!
The plain English requirement is to extract the space-delimited strings that immediately follow server_name then several spaces.
The dynamic duo of \G (start from the start / continue from the end of the last match) and \K (restart the fullstring match) will be the heroes of the day.
Code: (Demo)
$string = "server_name this.com www.this.com someother-example.com;";
var_export(preg_match_all('~(?:server_name +|\G(?!^) )\K[^; ]+~', $string, $out) ? $out[0] : 'no matches');
Output:
array (
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com',
)
Pattern Explanation:
(?: # start of non-capturing group (to separate piped expressions from end of the pattern)
server_name + # literally match "server_name" followed by one or more spaces
| # OR
\G(?!^) # continue searching for matches immediately after the previous match, then match a single space
) # end of the non-capturing group
\K # restart the fullstring match (aka forget any previously matched characters in "this run through")
[^; ]+ # match one or more characters that are NOT a semicolon or a space
The reason that you see \G(?!^) versus just \G (which, for the record, will work just fine on your sample input) is because \G can potentially match from two different points by its default behavior. https://www.regular-expressions.info/continue.html
If you were to use the naked \G version of my pattern AND add a single space to the front of the input string, you would not make the intended matches. \G would successfully start at the beginning of the string, then match the single space, then server_name via the negated character class [^; ].
For this reason, disabling \G's "start at the start of the string` ability makes the pattern more stable/reliable/accurate.
preg_match_all() returns an array of matches. The first element [0] is a collection of fullstring matches (what is matched regardless of capture groups). If there are any capture groups, they begin from [1] and increment with each new group.
Because you need to match server_name before targeting the substrings to extract, using capture groups would mean a bloated output array and an unusable [0] subarray of fullstring matches.
To extract the desired space-delimited substrings and omit server_name from the results, \K is used to "forget" the characters that are matched prior to finding the desired substrings. https://www.regular-expressions.info/keep.html
Without the \K to purge the unwanted leading characters, the output would be:
array (
0 => 'server_name this.com',
1 => ' www.this.com',
2 => ' someother-example.com',
)
If anyone is comparing my answer to user3776824's or HamZa's:
I am electing to be very literal with space character matching. There are 4 spaces after server_name, so I could have used an exact quantifier {4} but opted for a bit of flexibility here. \s* isn't the most ideal because when matching there will always be "one or more spaces" to match. I don't have a problem with \s, but to be clear it does match spaces, tabs, newlines, and line returns.
I am using (?!^) -- a negative lookahead -- versus (?<!^) -- a negative lookbehind because it does the same job with a less character. You will more commonly see the use of \G(?!^) from experienced regex craftsmen.
There is never a need to use "alternative" syntax (|) within a character class to separate values. user3776824's pattern will actually exclude pipes in addition to semicolons and spaces -- though I don't expect any negative impact in the outcome based on the sample data. The pipe in the pattern simply should not be written.

PHP regex and adjacent capturing groups

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.
I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:
HelloWorldThisIsATest => hello-world-this-is-a-test
My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:
mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));
The result:
hello-world-this-is-atest
This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.
What am I doing wrong?
The Reason your Regex will Not Work: Overlapping Matches
Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
In order to insert a - between the A and the T, the regex would have to match AT.
This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
Is all hope lost? No! This is a perfect situation for lookarounds.
Do it in Two Easy Lines
Here's the easy way to do it with regex:
$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));
See the output at the bottom of the php demo:
Output: hello-world-this-is-a-test
Will add explanation in a moment. :)
The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
We just replace these positions with a -, and convert the lot to lowercase.
If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
I've separated the two regular expressions for simplicity:
preg_replace(array('/([a-z])([A-Z])/', '/([A-Z]+)([A-Z])/'), '$1-$2', $string);
It processes the string twice to find:
lowercase -> uppercase boundaries
multiple uppercase letters followed by another uppercase letter
This will have the following behaviour:
ThisIsHTMLTest -> This-Is-HTML-Test
ThisIsATest -> This-Is-A-Test
Alternatively, use a look-ahead assertion (this will effect the reuse of the last capital letter that was used in the previous match):
preg_replace('/([A-Z]+|[a-z]+)(?=[A-Z])/', '$1-', $string);
To fix the interesting use case Jack mentioned in your comments (avoid splitting of abbreviations), I went with zx81's route of using lookahead and lookbehinds.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
You can split it in two for the explanation:
First part
(?<= look behind to see if there is:
[a-z] any character of: 'a' to 'z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-ahead
(TL;DR: Match between strings of the CamelCase Pattern.)
Second part
(?<= look behind to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
[a-z] any character of: 'a' to 'z'
) end of look-ahead
(TL;DR: Special case, match between abbreviation and CamelCase pattern)
So your code would then be:
mb_strtolower(preg_replace('/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/', '-', "HelloWorldThisIsATest"));
Demo of matches
Demo of code

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

Categories