Regular Expression explanation: (?>[^<>]+) - php

How should this, /(?>[^<>]+)/, be interpreted please? (PHP RegExp Engine)
Thank you.

(?> # I had to look this up, but apparently this syntax prevents the regex
# parser from backtracking into whatever is matched in this group if
# the rest of the pattern fails
[^<>]+ # match ANY character except '<' or '>', 1 or more times.
) # close non-backtrackable group.
For anyone interested in the once-only pattern, check out the section Once-only subpatterns in http://www.regextester.com/pregsyntax.html

Related

PHP PCRE - Regex upgraded now failing (Catastrophic backtracking) + optimization

I first posted this question :
Regex matching nested beginning and ending tags
It was answered perfectly by Wiktor Stribiżew. Now, I wanted to upgrade my Regex expression so that my parameters supports a JSON object (or almost, because lonely '{' and '[' aren't supported).
I have two expressions: one for paired tags, one for lonely tags. I first use the paired one, when all replacements done, I execute the lonely one. The modified lonely one works fine on regex101.com (https://www.regex101.com/r/HIEQZk/9), but the paired one tells me "castatrophic backtracking" (https://www.regex101.com/r/HIEQZk/8) even though in PHP in doesn't crash.
So could anyone help me optimize/fix this fairly huge regex.
Even though there seems to be useless escaping, it is because begin/end markers and the splitter can be customized and thus have to be escaped. (The paired one is not as escaped because it is not the one generated by PHP, but the one made by Wiktor Stribiżew with the modifications I did.)
The only part that I think that shall be optimized/fixed is the "parameters" group which I just modified to support JSON objects. (Tests of these can be seen in the earlier versions of the same regex101 url. The ones here are with a real HTML to parse.)
Lonely expression
~
\{\{ #Instruction start
([^\^\{\}]+) # (Group 1) Instruction name OR variable to reach if nothing else after then
(?:
\^
(?:([^\\^\{\}]*)\^)? #(Group 2) Specific delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
\}\} #Instruction end
~xg
Paired expression
~{{ # Opening tag start
(\w+) # (Group 1) Tag name
(?: # Not captured group for optional parameters
(?: # Not captured group for optional delimiter
\^ # Aux delimiter
([^^\{\}]?) # (Group 2) Specific delimiter
)?
\^ # Aux delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
}} # Opening tag end
( # (Group 4)
(?>
(?R) # Repeat the whole pattern
| # or match all that is not the opening/closing tag
[^{]*(?:\{(?!{/?\1[^\{\}]*}})[^{]*)*
)* # Zero or more times
)
{{/\1}} # Closing tag
~ix
Try to replace your (?: non-capturing groups with (?> atomic groups to prevent/reduce backtracking wherever possible. Those are non capturing as well. And/or experiment with possessive quantifiers while watching the stepscounter/debugger in regex101.
Wherever you don't want the engine to go back and try different other ways.
This is your updated demo where I just changed the first (?: to (?>

Replacing all matches except if surrounded by or only if surrounded by

Given a text string (a markdown document) I need to achieve one of this two options:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).
to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).
Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.
Here is an example of the first option
And here is an example of the second option
Any help and/or clue will be highly appreciated.
Thanks in advance!
Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:
-First one (Live demo):
\b(vitae)\b(?![^[]*]\s*\()
-Second one (Live demo):
{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()
Lookahead part explanations:
(?! # Starting a negative lookahead
[^[]*] # Everything that's between brackets
\s* # Any whitespace
\( # Check if it's followed by an opening parentheses
) # End of lookahead which confirms the whole expression doesn't match between brackets
(?= means a positive lookahead
You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:
patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN
So, according you needs:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax
You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:
\[.*?\](*SKIP)(*FAIL)|\bTheWord\b
Or using your pattern:
\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)
The idea behind this regex is tell regex engine to skip the content within [...]
Working demo
The first regex is easily fixed with a SKIP-FAIL trick:
\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b
To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.
See Demo 1
As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:
(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))
To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]
See Demo 2
PHP:
$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";

Trying to filter log-file for urls like "/fcfc/fcf/fc.php"

In my apache-access-logs I get a lot of invalid requests comming (probably) from robots.
All of the invalid urls follow the same pattern and I would like to filter them with a regex.
Here are some samples:
/oaoa/oao/oa.php
/fcfc/fcf/fc.php
/mcmc/mcm/mc.php
/rxrx/rxr/rx.php
/wlwl/wlw/wl.php
/nini/nin/ni.php
/gigi/gig/gi.php
/jojo/joj/jo.php
/okok/oko/ok.php
I can see the pattern, but I don't know how to build a (php-)regex that matches this pattern but not things like this. :-(
/help/one/xy.php
/some/oth/er.php
I hope anyone of you guys knows a solution, if it is possible at all.
If this is your exact input, the following regex should do the trick
/\/(.)(.)\1\2\/\1\2\1\/\1\2\.php/
https://regex101.com/r/rU2sE6/2
Note: Interesting problem although you should have showed us what you've tried. Which is why I'm putting this answer as Community Wiki to not earn any reputation.
So the trick is to capture the characters in a group and then assert that it is present in the next chunk. A bit cryptic I guess but here's the regex:
^ # Assert begin of line
(?: # Non-capturing group
( # Capturing group 1
/ # Match a forward slash
[^/]+ # Match anything not a forward slash one or more times
) # End of capturing group 1
[^/] # Match anything not a forward slash one time
(?=\1) # Assert that what we've matched in group 1 is ahead of us
# (ie: a forward slash + the characters - the last character)
)+ # End of non-capturing group, repeat this one or more times
\1\.php # Match what we've matched in group 1 followed by a dot and "php"
$ # Assert end of line
Do not forget to use the m modifier and x modifier.
Online demo
For these very specific cases you listed, here is a simple regex that will match them:
/([a-z])([a-z])\1\2/\1\2\1/\1\2.php
The \1 and \2 are references to the first and second groups. The forward slashes may need to be escaped. This is essentially saying match one character, then another, followed by the first character matched, then the second character matched, with a slash, etc.

Trouble with regular expression matching in lexer

I am in the process of making a templating engine that is quite complex as it will feature typical constructs in programming languages such as if statements and loops.
Currently, I am working on the lexer, which I believe, deals with the job of converting a stream of characters into tokens. What I want to do is capture certain structures within the HTML document, which later can be worked on by the parser.
This is an example of the syntax:
<head>
<title>Template</title>
<meta charset="utf-8">
</head>
<body>
<h1>{{title}}</h1>
<p>This is also being matched.</p>
{{#myName}}
<p>My name is {{myName}}</p>
{{/}}
<p>This content too.</p>
{{^myName}}
<p>I have on name.</p>
{{/}}
<p>No matching here...</p>
</body>
I am trying to scan only for everything between the starting '{{' characters and ending '}}' characters. So, {{title}} would be one match, along with {{#myName}}, the text and content leading up to {{/}}, this should then be the second match.
I am not particularly the best at regular expressions, and I am pretty sure it is an issue with the pattern I have devised, which is this:
({{([#\^]?)([a-zA-Z0-9\._]+)}}([\w\W]+){{\/?}})
I read this as match two { characters, then either # or ^ any words containing uppercase or lowercase letters, along with any digits, dots, or underscores. Match anything that comes after the closing }} characters, until either the {{/}} characters are met, but the /}} part is optional.
The problem is visible in the link below. It is matching text that is not within the {{ and }} blocks. I am wondering it is linked to the use of the \w and \W, because if I specify specifically what characters I want to match against in the set, it seems to then work.
The regular expression test is here. I did look at the regular expression is the shared list for capturing all text that isn't HTML, and I noticed it is using lookaheads which I just cannot grasp, or understand why they would help me.
Can someone help me by pointing out the problem with the regular expression, or whether or not I am going the wrong way about it in terms of creating the lexer?
I hope I've provided enough information, and thank you for any help!
Your pattern doesn't work because [\w\W]+ take all possible characters until the last {{/}} of your string. Quantifiers (i.e. +, *, {1,3}, ?) are greedy by default. To obtain a lazy quantifier you must add a ? after it: [\w\W]+?
A pattern to deal with nested structures:
$pattern = <<<'LOD'
~
{{
(?| # branch reset group: the interest of this feature is that
# capturing group numbers are the same in all alternatives
([\w.]++)}} # self-closing tag: capturing group 1: tag name
| # OR
([#^][\w.]++)}} # opening tag: capturing group 1: tag name
( # capturing group 2: content
(?> # atomic group: three possible content type
[^{]++ # all characters except {
| # OR
{(?!{) # { not followed by another {
| # OR
(?R) # an other tag is met, attempt the whole pattern again
)* # repeat the atomic group 0 or more times
) # close the second capturing group
{{/}} # closing tag
) # close the branch reset group
~x
LOD;
preg_match_all($pattern, $html, $matches);
var_dump($matches);
To obtain all nested levels you can use this pattern:
$pattern = <<<'LOD'
~
(?=( # open a lookahead and the 1st capturing group
{{
(?|
([\w.]++)}}
|
([#^][\w.]++)}}
( # ?R was changed to ?1 because I don't want to
(?>[^{]++|{(?!{)|(?1))* # repeat the whole pattern but only the
) # subpattern in the first capturing group
{{/}}
)
) # close the 1st capturing group
) # and the lookahead
~x
LOD;
preg_match_all($pattern, $html, $matches);
var_dump($matches);
This pattern is only the first pattern enclosed in a lookahead and a capturing group. This construct allows to capture overlapping substrings.
More informations about regex features used in these two patterns:
possessive quantifiers ++atomic groups (?>..)lookahead (?=..), (?!..)branch reset group (?|..|..)recursion (?R), (?1)

PHP Template engine regex

I'm trying to write a PHP template engine.
Consider the following string:
#foreach($people as $person)
<p></p>
$end
I am able to use the following regex to find it:
#[\w]*\(.*?\).*?#end
But if I have this string:
#cake()
#cake()
#fish()
#end
#end
#end
The regex fails, this is what it finds:
#cake()
#cake()
#fish()
#end
Thanks in advance.
You can match nested functions, example:
$pattern = '~(#(?<func>\w++)\((?<param>[^)]*+)\)(?<content>(?>[^#]++|(?-4))*)#end)~';
or without named captures:
$pattern = '~(#(\w++)\(([^)]*+)\)((?>[^#]++|(?-4))*)#end)~';
Note that you can have all the content of all nested functions, if you put the whole pattern in a lookahead (?=...)
pattern details:
~ # pattern delimiter
( # open the first capturing group
#(\w++) # function name in the second capturing group
\( # literal (
([^)]*+) # param in the third capturing group
\) # literal )
( # open the fourth capturing group
(?> # open an atomic group
[^#]++ # all characters but # one or more times
| # OR
(?-4) # the first capturing group (the fourth on the left, from the current position)
)* # close the atomic group, repeat zero or more times
) # close the fourth capturing group
#end
)~ # close the first capturing group, end delimiter
You have nesting, which takes you out of the realm of a regular grammar, which means that you can't use regular expressions. Some regular expression engines (PHP's included, probably) have features that let you recognize some nested expressions, but that'll only take you so far. Look into traditional parsing tools, which should be able to handle your work load. This question goes into some of them.

Categories