regex, php, and the evil nested (?R) - php

UPDATE
So I am still messing with this, and have gotten as far as finding all the instances of tags, though I'd rather JUST find the deepest stacked instance, as life would be easier that way.. Anyway here is what I got..
/(({{)(?:(?=([^\/][^ ]*?))\3|(\/[\w])))([a-zA-Z0-9\$\'\"\s\#\%\^\&\!\.\_\+\=\-\\\*\(\)\ ]+?}})/
Are there ANY regexp guru's out there that could give me some pointers or a regexp that mimics what I need? Which is only getting the deepest stacked instance of a {{tag}} that ends like this {{//tag}}
ORIGINAL
Ok, so I have an issue I have seen others have, but with a different approach to it.. Or so I thought.. So I am curious if anyone else can help me solve this issue further..
I have a database full of templates that I need to work with in PHP, these templates are made and used by another system, and so there for can not be changed. With that said, these templates have hierarchy style tags added to them. What I need to do, is get these templates from the database, and then programmatically find these tags, their function name (or tag name), and their inner contents, as well as anything following the function (tag) name within the brackets.. An example of one of these tags is, {{FunctionName some (otherStuff) !Here}} Some content sits inside and it ends {{/FunctionName}}
This is where it gets more fun, the templates have another random tag, which I am guessing are the "variable" style of these tags, as they are always generally the same syntax. Which looks like this, ${RandomTag}, but also there are times that the function style one is there but without an ending tag, like so.. {{RandomLoner}}
Example Template...
{{FunctionTag (Condition?)}}
<div>This is an {{CheckOfSomeSort someTimesThese !orThese}}
example of some {{Random}} data
{{/CheckOfSomeSort}} that will be ${worked} on</div>
{{/FunctionTag}}
Ok so in no way is this a real template, but it follows all the rules that I have seen thus far.
Now I have tried different things with regex and preg_match_all to pull out the matches, and get each of these into a nice array. So far what I have got is this (used it on the example template to make sure its working still)
Array
(
[0] => Array
(
[0] => {{CheckOfSomeSort someTimesThese !orThese}}example of some datas{{/CheckOfSomeSort}}
[1] => {{CheckOfSomeSort someTimesThese !orThese}}
[2] => CheckOfSomeSort
[3] => example of some data
[4] => {{/CheckOfSomeSort}}
)
)
I have tried a couple approaches, (that took me nearly 8 hours to get to)
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})(?:(?=([^{{]+))\3|{{(?!\2[^}}]*}}))*?({{\/\2}})/
AND, more recently...
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})((?:(?!\{\{|\}\}).)++|(?R)*)({{\/\2}})/
In no way am I a guru with regexp, I actually just learned it over the last day or so, trying to get this to work. I have googled for this, and realize that regexp is not designed for nested stuff, but the (?R) seems to do the trick on simple bracket examples Ive seen on the internets, but they always only take into account of the stuff between the { and } or ( and ) or < and >. After reading nearly the whole regex info website, and playing, I came up with these 2 versions.
So what I NEED to do (I think), would have a regexp work from the DEEPEST hierarchy tag first, and work its way out (if I can do that with help from php, thats fine with me). I was thinking finding the Deepest layer, get its data, and work backwards til all the contents are in 1 fat array. I assumed that was what the ($R) was going to do for me, but it didn't.
So any help on what I am missing would be great, also take into note that mine seems to have issues with {{}} that DONT have an ending version of it. So like my {{Random}} example, was removed for the sake of me parsing the array example. I feel these tags, along with the ${} tags can be left alone (if I knew how to do that with regexp), and just remain in the text where they are. I am more or less interested in the functions and getting their data into a multidimensional array for me to work with further.
Sorry for the long post, I just have been banging me head all night with this. I started with the assumption that it was going to be a bit easier.. Til I realized the tags where nested :/
Any help is appreciated! Thanks!

Wow, what a strange templating syntax.
The method I would probably use to tackle this problem would be something like:
Use a simple regex to change all the {{tags}} to <tags>
Use another simple regex to convert the space-delimited arguments/conditions inside tags to XML-like attribute syntax (ex. {{foo bar !baz}} would become <foo arg1="bar" arg2="!baz"> or similar)
Process it as a DOMDocument.
Have fun. :-)

Warning! You are trying to write a parser with just regular expressions. That doesn't work very well. Why not? Because you need to store state as well!
So what then? Well, you write a parser of course :D
If you need any tips on how to get started I can help but I'd encourage you to try it by yourself first. How does a parser work anyway? :)
Tokenize your input. And transform it to a nested tree like so:
array(
array("code", "FunctionTag (Condition?)", array(
"<div>This is an ",
array("code", "CheckOfSomeSort someTimesThese !orThese", array(
"example of some ",
array("code", array("Random"), array()),
" data"
)),
" that will be ${worked} on</div>"
))
)
Now you just have to interpret the code parts and produce the expected output. You could also add things like line numbers and character positions which is very useful for debugging.

After a bit of time working on it, I ultimately learned more about regex, and understand it to a T now. Great thing about this, is that PHP has the (?R) and I now understand why it even looks like this. lol
In the end, the regex that I got working spawned off the php page that explained the recursive (?R). I then just worked on getting the tags regex in place of the parenthesis they were using in the example.
I know I wanted the inner most tag, but ofcourse can accomplish the same thing with the outermost tag, so this regex does just that. It finds and grabs the outer most {{tag (thatMightHaveDataHere)}} And has inner contents that may be more {{TAGS}} within it.{{/tag}}
Here it is,
/{{([\w]+) ?([^}]*?)(?:}}((?:[^{]*?|(?R)|{{[\w]*?}}|\${.*?})*){{\/\1}})/
0 = Matched "Outer Tag"
1 = Tag that was found, ie {{tag}}{{/\1}}
2 = Any data after the first space, within the tag, ie {{tag ThisDataIs StoredAs2}}
3 = INNER Content (which can be the recursive of this regex, or a non ended tag {{noEndTag}}, or a tag that starts with a dollar ${likeThis}
Run a loop on the $match[3] with this regex, and you can cycle through finding them. Not sure where you would use this outside what I needed it for, but I am sure someone can modify it if they need it to work on some other nested style structure.

Related

Recursive Regex in PHP with variable names

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>
Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

Stuck with regexp

I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?
if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [
For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.
It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.
Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.

Help with Regex in PHP

Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>
You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.
preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.
Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/

PHP - I need some help with my Regex

I've created a simple template 'engine' in PHP to substitute PHP-generated data into the HTML page. Here's how it works:
In my main template file, I have variables like so:
<title><!-- %{title}% --></title>
I then assign data into those variables for the main page load
$assign = array (
'title' => 'my website - '
);
I then have separate template blocks that get loaded for the content pages. The above really just handles the header and the footer. In one of these 'content template files', I have variables like so:
<!-- %{title=content page}% -->
Once this gets executed, the main template data is edited to include the content page variables resulting in:
<title>my website - content page</title>
It does this with the following code:
if (preg_match('/<!-- %{title=\s*(.*?)}% -->/s', $string, $matches)) {
// Find variable names in the form of %{varName=new data to append}%
// If found, append that new data to the exisiting data
$string = preg_replace('/<!-- %{title=\s*(.*?)}% -->/s', null, $string);
$varData[$i] .= $matches[1];
}
This basically removes the template variables and then assigns the variable data to the existing variable. Now, this all works fine. What I'm having issues with is nesting template variables. If I do something like:
<!-- %{title=content page (author: <!-- %{name}% -->) -->
The pattern, at times, messes up the opening and closing tags of each variable.
How can I fix my regular expression to prevent this?
Thank you.
The answer is you don't do this with regex. Regular expressions are a regular language. When you start nesting things it is no longer a regular language. It is, at a minimum, a context-free language ("CFL"). CFLs can only be processed (assuming they're unambiguous) with a stack.
Specifically, regular languages can be represented with a finite state machine ("FSM"). CFLs require a pushdown automaton ("PDA").
An example of the difference is nested tags in HTML:
<div>
<div>inner</div>
</div>
My advice is don't write your own template language. That's been done. Many times. Use Smarty or something in Zend, Kohana or whatever. If you do write your own, do it properly. Parse it.
Why are you rolling your own template engine? If you want this kind of complexity, there's a lot of places that have already come up with solutions for it. You should just plug in Smarty or something like that.
If you're asking what I think you're asking, it's literally impossible. If I read your question correctly, you want to match arbitrarily-nested <!-- ... --> sequences with particular things inside. Unfortunately, regular expressions can only match certain classes of strings; any regular expression can match only a regular language. One well-known language which is not regular is the language of balanced parentheses (also known as the the Dyck language), which is exactly what you're trying to match. In order to match arbitrarily-nested comment strings, you need a more powerful tool. I'm fairly sure there are pre-existing PHP template engines; you might look into one of those.
To resolve your problem you should
replace preg_match() with preg_match_all();
find the pattern, and replace them from the last one to the first one;
use a more restrictive pattern like '/<!-- %{title=\s*([^}]*?)}% -->/s'.
I've done something similar in the past, and I have encountered the same nesting issue you did. In your case, what I would do is repeatedly search your text for matches (rather than searching once and looping through the matches) and extract the strings you want by searching for anything that doesn't include your closing string.
In your case, it would probably look like this:
/(<!--([^(-->)]*?)-->)/
Regexes like this are a nightmare to explain, but basically, ([^(-->)]*) will find any string that doesn't include your closing tag (let's call that AAA). It will be inside a matching group that is, itself, your template tag, (<!--AAA-->).
I'm convinced this sort of templating method is the wrong way to do things, but I've never known enough to do it better. It's always bothered me in ASP and ColdFusion that you had to nest your scripting tags inside HTML and when I started to do it myself, I considered it a personal failure.
Most Regexes I do now are in JavaScript and so I may be missing some of the awesome nuances PHP has via Perl. I'd be happy if someone can write this more cleanly.
I too have ran into this problem in the past, although I didn't use regular expressions.
If instead you search from right to left for the opening tag, <!-- %{ in your syntax, using strrpos (PHP5+), then search forwards for the first occurrence of the next closing tag, and then replace that chunk first, you will end up replacing the inner-most nested variables first. This should resolve your problem.
You can also do it the other way around and find the first occurrence of a closing tag, and work backwards to find its corresponding opening tag.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories