If-else in recursive regex not working as expected - php

I am using a regex to parse some BBCode, so the regex has to work recursively to also match tags inside others. Most of the BBCode has an argument, and sometimes it's quoted, though not always.
A simplified equivalent of the regex I'm using (with html style tags to reduce the escaping needed) is this:
'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'
However, if I have a test string that looks like this:
<"a">Content<a>Also Content</a></a>
it only matches the <a>Also Content</a> because when it tries to match from the first tag, the first matching group, \1, is set to ", and this is not overwritten when the regex is run recursively to match the inner tag, which means that because it isn't quoted, it doesn't match and that regex fails.
If instead I consistently either use or don't use quotes, it works fine, but I can't be sure that that will be the case with the content that I have to parse. Is there any way to work around this?
The full regex that I'm using, to match [spoiler]content[/spoiler], [spoiler=option]content[/spoiler] and [spoiler="option"]content[/spoiler], is
"~\[spoiler\s*+ #Match the opening tag
(?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
(?:\ *(?:\n|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
((?:[^\[\n]++ #Capture all characters until the closing tag
|\n(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
|\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
|(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
# without messing up nesting
\n? #Get rid of an extra new line before the closing tag if necessary
\[/spoiler] #match the closing tag
~xi"
There are a couple of other bugs with it as well though.

The simplest solution is to use alternatives instead:
<(?:a|"a")>
([^<]++ | (?R))*
</a>
But if you really don't want to repeat that a part, you can do the following:
<("?)a\1>
([^<]++ | (?R))*
</a>
Demo
I've just put the conditional ? inside the group. This time, the capturing group always matches, but the match can be empty, and the conditional isn't necessary anymore.
Side note: I've applied a possessive quantifier to [^<] to avoid catastrophic backtracking.
In your case I believe it's better to match a generic tag than a specific one. Match all tags, and then decide in your code what to do with the match.
Here's a full regex:
\[
(?<tag>\w+) \s*
(?:=\s*
(?:
(?<quote>["']) (?<arg>.{0,100}?) \k<quote>
| (?<arg>[^\]]+)
)
)?
\]
(?<content>
(?:[^[]++ | (?R) )*+
)
\[/\k<tag>\]
Demo
Note that I added the J option (PCRE_DUPNAMES) to be able to use (?<arg>...) twice.

(?(1)...) only checks if the group 1 has been defined, so the condition is true once the group is defined the first time. That is why you obtain this result (it is not related with the recursion level or whatever).
So when <a> is reached in the recursion, the regex engine try to match <a"> and fails.
If you want to use a conditional statement, you can write <("?)a(?(1)\1)> instead. In this way the group 1 is redefined each times.
Obviously you can write your pattern in a more efficient way like this:
~<(?:a|"a")>[^<]*+(?:(?R)[^<]*)*+</a>~
For your particular problem, I will use this kind of pattern to match any tags:
$pattern = <<<'EOD'
~
\[ (?<tag>\w+) \s*
(?:
= \s*
(?| " (?<option>[^"]*) " | ' ([^']*) ' | ([^]\s]*) ) # branch reset feature
)?
\s* ]
(?<content> [^[]*+ (?: (?R) [^[]*)*+ )
\[/\g{tag}]
~xi
EOD;
If you want to impose a specific tag at the ground level, you can add (?(R)|(?=spoiler\b)) before the tag name.

Related

PHP PCRE - Regex upgraded now failing (Catastrophic backtracking) + optimization

I first posted this question :
Regex matching nested beginning and ending tags
It was answered perfectly by Wiktor Stribiżew. Now, I wanted to upgrade my Regex expression so that my parameters supports a JSON object (or almost, because lonely '{' and '[' aren't supported).
I have two expressions: one for paired tags, one for lonely tags. I first use the paired one, when all replacements done, I execute the lonely one. The modified lonely one works fine on regex101.com (https://www.regex101.com/r/HIEQZk/9), but the paired one tells me "castatrophic backtracking" (https://www.regex101.com/r/HIEQZk/8) even though in PHP in doesn't crash.
So could anyone help me optimize/fix this fairly huge regex.
Even though there seems to be useless escaping, it is because begin/end markers and the splitter can be customized and thus have to be escaped. (The paired one is not as escaped because it is not the one generated by PHP, but the one made by Wiktor Stribiżew with the modifications I did.)
The only part that I think that shall be optimized/fixed is the "parameters" group which I just modified to support JSON objects. (Tests of these can be seen in the earlier versions of the same regex101 url. The ones here are with a real HTML to parse.)
Lonely expression
~
\{\{ #Instruction start
([^\^\{\}]+) # (Group 1) Instruction name OR variable to reach if nothing else after then
(?:
\^
(?:([^\\^\{\}]*)\^)? #(Group 2) Specific delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
\}\} #Instruction end
~xg
Paired expression
~{{ # Opening tag start
(\w+) # (Group 1) Tag name
(?: # Not captured group for optional parameters
(?: # Not captured group for optional delimiter
\^ # Aux delimiter
([^^\{\}]?) # (Group 2) Specific delimiter
)?
\^ # Aux delimiter
([^\{\}]*{(?:[^{}\[\]]+|(?3))+}[^\{\}]*|[^\{\}]*\[(?:[^{}\[\]]+|(?3))+\][^\{\}]*|[^\{\}]+) # (Group 3) Parameters
)?
}} # Opening tag end
( # (Group 4)
(?>
(?R) # Repeat the whole pattern
| # or match all that is not the opening/closing tag
[^{]*(?:\{(?!{/?\1[^\{\}]*}})[^{]*)*
)* # Zero or more times
)
{{/\1}} # Closing tag
~ix
Try to replace your (?: non-capturing groups with (?> atomic groups to prevent/reduce backtracking wherever possible. Those are non capturing as well. And/or experiment with possessive quantifiers while watching the stepscounter/debugger in regex101.
Wherever you don't want the engine to go back and try different other ways.
This is your updated demo where I just changed the first (?: to (?>

Replacing all matches except if surrounded by or only if surrounded by

Given a text string (a markdown document) I need to achieve one of this two options:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).
to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).
Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.
Here is an example of the first option
And here is an example of the second option
Any help and/or clue will be highly appreciated.
Thanks in advance!
Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:
-First one (Live demo):
\b(vitae)\b(?![^[]*]\s*\()
-Second one (Live demo):
{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()
Lookahead part explanations:
(?! # Starting a negative lookahead
[^[]*] # Everything that's between brackets
\s* # Any whitespace
\( # Check if it's followed by an opening parentheses
) # End of lookahead which confirms the whole expression doesn't match between brackets
(?= means a positive lookahead
You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:
patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN
So, according you needs:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax
You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:
\[.*?\](*SKIP)(*FAIL)|\bTheWord\b
Or using your pattern:
\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)
The idea behind this regex is tell regex engine to skip the content within [...]
Working demo
The first regex is easily fixed with a SKIP-FAIL trick:
\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b
To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.
See Demo 1
As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:
(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))
To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]
See Demo 2
PHP:
$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";

php - detect HTML in string and wrap with code tag

I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags.
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>.
//expected result
Don't wrap me<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>Don't wrap me <code><h1>End</h1></code>.
Is this possible?
It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:
$html = <<<EOD
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<self> < [^\W_]++ [^>]* > )
(?<comment> <!-- (?>[^-]++|-(?!->))* -->)
(?<cdata> \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
(?<text> [^<]++ )
(?<tag>
< ([^\W_]++) [^>]* >
(?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
</ \g{-1} >
)
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;
$html = preg_replace($pattern, '<code>$0</code>', $html);
echo htmlspecialchars($html);
The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.
(?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:
self: that describes a self-closing tag
comment: for html comments
cdata: for cdata
text: for text (all that is not a tag, a comment, or cdata)
tag: for html tags that are not self-closed
self:
[^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern.
[^>]* means all that is not a > zero or more times.
comment:
(?>[^-]++|-(?!->))* describes all the possible content inside an html comment:
(?> # open an atomic group
[^-]++ # all that is not a literal -, one or more times (possessive)
| # OR
- # a literal -
(?!->) # not followed by -> (negative lookahead)
)* # close and repeat the group zero or more times
cdata:
All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).The content allowed in CDATA is described in the same way than the content in html comments.
text:[^<]++ all characters until an opening angle bracket or the end of the string.
tag:This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.
Once all items have been set and the definition section closed, you can easily write the main pattern.
I'm in a trouble with treating HTML in text content.
then just escape that text:
echo htmlspecialchars($your_text_that_may_contain_html_code);
parsing html with regex is a well-known-big-NO!
This will find tags along with their closing tags, and everything in between:
<[A-Z][A-Z0-9]*\b[^>]*>.*?</\1>
You might be able to capture those tags and replace them with the tags around them. It may not work with every case, but you might find it sufficient for your needs if the html is fairly static.

regex matches numbers, but not letters

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.

Regex: matching all improper tag attributes

For an example input of:
<a href="abc" something=b foo="bar" baz=cool>
I am trying to match:
something=b
baz=cool
However, everything I end up with will only match the first one (something=b), even when using preg_match_all. The regular expression I am using is:
<\w+.*?(\w+=[^"|^'|^>]).*?>
Or:
<\w+ # Word starting with <
.*? # Anything that comes in front of the matching attribute.
(
\w+ # The attribute
=
[^"|^'|^>]+? # Keep going until we find a ", ' or >
)
.*? # Anything that comes after the matching attribute.
> # Closing >
I'm probably doing something horribly wrong, pretty new to regular expressions. Please advise! :)
edit:
Revised regular expression:
<\w+.*?\w+=([^"\'\s>]+).*?>
I want it to match zzz=aaa there too ;)
Use a library like Tidy or HTMLPurifier to fix broken HTML for you.
For starters, the caret "^" symbol negates the entire character class. The character class has implied or statements, that's the point of a character class, so your class can be shortened to [^'">]
Now as to why you're only getting the "something=b" tag, I believe you're missing a + after your character class.
So your regexp with my modifications would be:
<\w+.*?(\w+=[^"'>]+?) .*?>
Note the space after the end of the group
<\w+
(?:
\s+
(?:
\w+="[^"]*"
|(\w+=[^\s>]+)
)
)+
\s*/?>
You may try this with # delimiter and x modifier. I have formatted it so it is more readable.
In your regex <\w+.*?(\w+=[^"|^'|^>]).*?>, the \w+=[^"|^'|^>] part doesn't do what you think it does - you are mixing character classes and alternation with pipe character
Writing a regex that will catch all malformed attributes inside a given XMLish tag is tricky if the attribute value can have > or = characters.
For example:
<a href="asd" title=This page proves that e=MC^2>
Your regex tries to extract all attributes from the whole string in one step - it looks for <tag and then an unquoted attribute somewhere later. This way you'll match only one attribute, the first one.
You can extract the contents of the opening and closing angle brackets in one step and then look for attributes within that. The regex <\w+\s+([^>]+?)\s*> will give you the substring of attributes. Search within that string for unquoted attributes. If the attributes are simple (as in they don't contain spaces), you can use a simple
\w+=(?=[^"'])[^ ]+
If they can contain spaces too, you're gonna need some more lookahead:
\w+=(?=[^"']).+?(?=\w+=|$)
If you know you don't have any = sign outside your tags, you can use this regex:
(?<=\=)([^"\'\s>]+)(?=[\s>])
In this example it matches all improper attributes
Edit:
(?<=\=)([^"\'\s/>]+)(?=[\s/?>])
this matches class2 in <div class=class2/> also.

Categories