Nested regex... I'm clueless!

Nested regex... I'm clueless! - php

I'm pretty clueless when it comes to PHP and regex but I'm trying to fix a broken plugin for my forum.
I'd like to replace the following:
<blockquote rel="blah">foo</blockquote>
With
<blockquote class="a"><div class="b">blah</div><div class="c"><p>foo</p></div></blockquote>
Actually, that part is easy and I've already partially fixed the plugin to do this. The following regex is being used in a call to preg_replace_callback() to do the replacement:
/(<blockquote rel="([\d\w_ ]{3,30})">)(.*)(<\/blockquote>)/u
The callback code is:
return <<<BLOCKQUOTE
<blockquote class="a"><div class="b">{$Matches[2]}</div><div class="c"><p>{$Matches[3]}</p></div></blockquote>
BLOCKQUOTE;
And that works for my above example (non-nested blockquotes). However, if the blockquotes are nested, such as in the following example:
<blockquote rel="blah">foo <blockquote rel="bloop">bar ...maybe another nest...</blockquote></blockquote>
It doesn't work. So my question is, how can I replace all nested blockquotes using a combination of regex/PHP? I know recursive patterns are possible in PHP with (?R); the following regex will extract all nested blockquotes from the string containing them:
/(<blockquote rel="([\d\w_ ]{3,30})">)(.*|(?R))(<\/blockquote>)/s
But from there on I'm not quite sure what to do in the preg_replace_callback() callback to replace each nested blockquote with the above replacement.
Any help would be appreciated.

The simple answer is that you can't do this with regex. The language of nested tags (or parens, or brackets, or anything) of an arbitrary depth is not regular and hence cannot be matched with a regular expression. I would suggest you use a DOM parser or - if absolutely necessary for some weird reason - write your own parsing scheme.
The complicated answer is that you might be able to do this with some really ugly, hacky regex and PHP code, but I wouldn't advise it to be quite honest.
See also: The Chomsky hierarchy.
Also see also:
Matching nested [quote] using regexp
Find <td> with specific class including nested tables
How using regex can I capture the outer HTML element when the same element type is nested within it?

There's no direct support for recursive substitutions, and preg_replace_callback() isn't particularly useful in this case. But there's nothing stopping you doing the substitution in multiple passes. The first pass takes care of the outermost tags, and subsequent passes work their way inward. The optional $count argument tells you how many replacements were performed in each pass; when it comes up zero, you're done.
$regex = '~(<BQ rel="([^"]++)">)((?:(?:(?!</?+BQ\b).)++|(?R))*+)(</BQ>)~s';
$sub = '<BQ class="a"><div class="b">$2</div><div class="c"><p>$3</p></div></BQ>';
do {
$s = preg_replace($regex, $sub, $s, -1, $count);
} while ($count != 0);
See it in action on ideone.com

Related

preg_match text that isn't inside or between html tags [duplicate]

I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:
yasar
So that yasar reads <span class="selected-word">yasar</span> , my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace() I am using looks like this:
preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);
How can I make a regular expression, so that it doesn't match anything inside a html tag?

You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:
/(asf|foo|barr)(?=[^>]*(<|$))/
See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.

Yasar, resurrecting this question because it had another solution that wasn't mentioned.
Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>.
With all the disclaimers about using regex to parse html, here is the regex:
<[^>]*>(*SKIP)(*F)|word1|word2|word3
Here is a demo. In code, it looks like this:
$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);
Here is an online demo of this code.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

This might be the kind of thing that you're after: http://snipplr.com/view/3618/
In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:
[b]bold text[b] [i]italic text[i]
However I appreciate that this might not work well with what you're trying to do.
Another option may be HTML Purifier, see: http://htmlpurifier.org/

From top of my mind, this should be working:
echo preg_replace("/<(.*)>(.*)<\/(.*)>/i","<$1><span class=\"some-class\">$2</span></$3>",$target);
But, I don't know how safe this would be. I am just presenting a possibility :)

Regex - Match everything except HTML tags

I've searched for this but couldn't find a solution that worked for me.
I need regex pattern that will match all text except html tags, so I can make it cyrilic (which would obviously ruin the entire html =))
So, for example:
<p>text1</p>
<p>text2 <span class="theClass">text3</span></p>
I need to match text1, text2, and text3, so something like
preg_match_all("/pattern/", $text, $matches)
and then I would just iterate over the matches, or if it can be done with preg_replace, to replace text1/2/3, with textA/B/C, that would be even better.

As you probably know, regex is not a great choice for this (the general advice here will be to use a Dom parser).
However, if you needed a quick regex solution, you use this (see demo):
<[^>]*>(*SKIP)(*F)|[^<]+
How this works is that on the left the <[^>]*> matches complete <tags>, then the (*SKIP)(*F) causes the regex to fail and the engine to advance to the position in the string that follows the last character of the matched tag.
This is an application of a general technique to exclude patterns from matches (read the linked question for more details).
If you don't want to allow the matches to span several lines, add \r\n to the negated character class that does your matching, like so:
<[^>]*>(*SKIP)(*F)|[^<\r\n]+

How about this RegEx:
/(?<=>)[\w\s]+(?=<)/g
Online Demo

Maybe this one (in Ruby):
/(?<!<)(?<!<\/)(?<![<\/\w+])([[:alpha:]])+(?!>)/
Enjoy !

Please use PHP DOMDocument class to parse XML content :
PHP Doc

php regex to match outside of html tags

I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:
yasar
So that yasar reads <span class="selected-word">yasar</span> , my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace() I am using looks like this:
preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);
How can I make a regular expression, so that it doesn't match anything inside a html tag?

You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:
/(asf|foo|barr)(?=[^>]*(<|$))/
See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.

Yasar, resurrecting this question because it had another solution that wasn't mentioned.
Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>.
With all the disclaimers about using regex to parse html, here is the regex:
<[^>]*>(*SKIP)(*F)|word1|word2|word3
Here is a demo. In code, it looks like this:
$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);
Here is an online demo of this code.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

This might be the kind of thing that you're after: http://snipplr.com/view/3618/
In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:
[b]bold text[b] [i]italic text[i]
However I appreciate that this might not work well with what you're trying to do.
Another option may be HTML Purifier, see: http://htmlpurifier.org/

From top of my mind, this should be working:
echo preg_replace("/<(.*)>(.*)<\/(.*)>/i","<$1><span class=\"some-class\">$2</span></$3>",$target);
But, I don't know how safe this would be. I am just presenting a possibility :)

Parsing nested structures in PHP with preg_match

Hello I want to make something like a meta language which gets parsed and cached to be more performant. So I need to be able to parse the meta code into objects or arrays.
Startidentifier: {
Endidentifier: }
You can navigate through objects with a dot(.) but you can also do arithmetic/logic/relational operations.
Here is an example of what the meta language looks like:
{mySelf.mother.job.jobName}
or nested
{mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}
or with operations
{obj.val * (otherObj.intVal + myObj.longVal) == 1200}
or more logical
{obj.condition == !myObj.otherCondition}
I think most of you already understood what i want. At the moment I can do only simple operations(without nesting and with only 2 values) but nesting for getting values with dynamic property names works fine. also the text concatination works fine
e.g. "Hello {myObj.name}! How are you {myObj.type}?".
Also the possibility to make short if like (condition) ? (true-case) : (false-case) would be nice but I have no idea how to parse all that stuff. I am working with loops with some regex at the moment but it would be probably faster and even more maintainable if I had more in regex.
So could anyone give me some hints or want to help me? Maybe visit the project site to understand what I need that for: http://sourceforge.net/projects/blazeframework/
Thanks in advance!

It is non-trivial to parse a indeterminate number of matching braces using regular expressions, because in general, either you will match too much or too little.
For instance, consider Hello {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}? to use two examples from your input in the same string:
If you use the first regular expression that probably comes to mind \{.*\} to match braces, you will get one match: {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size} This is because by default, regular expressions are greedy and will match as much as possible.
From there, we can try to use a non-greedy pattern \{.*?\}, which will match as little as possible between the opening and closing brace. Using the same string, this pattern will result in two matches: {myObj.name} and {mySelf.{myObj.{keys["ObjProps"][0]}. Obviously the second is not a full expression, but a non-greedy pattern will match as little as possible, and that is the smallest match that satisfies the pattern.
PCRE does allow recursive regular expressions, but you're going to end up with a very complex pattern if you go down that route.
The best solution, in my opinion, would be to construct a tokenizer (which could be powered by regex) to turn your text into an array of tokens which can then be parsed.

maybe have a look at the PREG_OFFSET_CAPTURE flag!?

Regular expression to put <P> AND <ul>/<ol> into array

I'm searching for a function in PHP to put every paragraph element like <p>, <ul> and <ol> into an array. So that i can manipulate the paragraph, like displayen the first two paragraphs and hiding the others.
This function does the trick for the p-element. How can i adjust the regexp to also match the ul and ol? My tryout gives an error: complaining the < is not an operator...
function aantalP($in){
preg_match_all("|<p>(.*)</p>|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
//tryout:
function aantalPT($in){
preg_match_all("|(<p> | <ol>)(.*)(</p>|</o>)|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
Can anyone help me?

You can't do this reliably with regular expressions. Paragraphs are mostly OK because they're not nested generally (although they can be). Lists however are routinely nested and that's one area where regular expressions fall down.
PHP has multiple ways of parsing HTML and retrieving selected elements. Just use one of those. It'll be far more robust.
Start with Parse HTML With PHP And DOM.
If you really want to go down the regex route, start with:
function aantalPT($in){
preg_match_all('!<(p|ol)>(.*)</\1>!Us', $in, $out);
return $out;
}
Note: PREG_PATTERN_ORDER is not required as it is the default value.
Basically, use a backreference to find the matching tag. That will fail for many reasons such as nested lists and paragraphs nested within lists. And no, those problems are not solvable (reliably) with regular expressions.
Edit: as (correctly) pointed out, the regex is also flawed in that it used a pipe delimeter and you were using a pipe character in your regex. I generally use ! as that doesn't normally occur in the pattern (not in my patterns anyway). Some use forward slashes but they appear in this pattern too. Tilde (~) is another reasonably common choice.

First of all, you use | as delimiter to mark the beginning and end of the regular expression. But you also use | as the or sign. I suggest you replace the first and last | with #.
Secondly, you should use backreferences with capture of the start and end tag like such: <(p|ul)>(.*?)</\1>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Nested regex... I'm clueless! - php

Related

preg_match text that isn't inside or between html tags [duplicate]

Regex - Match everything except HTML tags

php regex to match outside of html tags

Parsing nested structures in PHP with preg_match

Regular expression to put <P> AND <ul>/<ol> into array

Categories

Resources