Recursive Regex in PHP with variable names - php

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>

Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

Related

Matching all three kinds of PHP comments with a regular expression

I need to match all three types of comments that PHP might have:
# Single line comment
// Single line comment
/* Multi-line comments */
 
/**
* And all of its possible variations
*/
Something I should mention: I am doing this in order to be able to recognize if a PHP closing tag (?>) is inside a comment or not. If it is then ignore it, and if not then make it count as one. This is going to be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (because it's driving me nuts!). I tried to achieve this a couple of hours, but I wasn't able. How can I translate for it to work with XML?
So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure regular expression expression, no language features or anything. :)
Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:
<?php
echo 'something'; # this is a comment
?>
Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.
But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):
~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s
Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.
You can also combine all of these into one regex:
~(?:#|//)[^\r\n]*|/\*.*?\*/~s
If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.
If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):
(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/
See here for a working demo.
Now if you also want to capture the contents of the comments in a group, then you could do this
(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)
Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.
Another working demo.
Single-line comments
singleLineComment = /'[^']*'|"[^"]*"|((?:#|\/\/).*$)/gm
With this regex you have to replace (or remove) everything that was captured by ((?:#|\/\/).*$). This regex will ignore contents of strings that would look like comments (e.g. $x = "You are the #1"; or $y = "You can start comments with // or # in PHP, but I'm a code string";)
Multiline comments
multilineComment = /^\s*\/\*\*?[^!][.\s\t\S\n\r]*?\*\//gm

regex, php, and the evil nested (?R)

UPDATE
So I am still messing with this, and have gotten as far as finding all the instances of tags, though I'd rather JUST find the deepest stacked instance, as life would be easier that way.. Anyway here is what I got..
/(({{)(?:(?=([^\/][^ ]*?))\3|(\/[\w])))([a-zA-Z0-9\$\'\"\s\#\%\^\&\!\.\_\+\=\-\\\*\(\)\ ]+?}})/
Are there ANY regexp guru's out there that could give me some pointers or a regexp that mimics what I need? Which is only getting the deepest stacked instance of a {{tag}} that ends like this {{//tag}}
ORIGINAL
Ok, so I have an issue I have seen others have, but with a different approach to it.. Or so I thought.. So I am curious if anyone else can help me solve this issue further..
I have a database full of templates that I need to work with in PHP, these templates are made and used by another system, and so there for can not be changed. With that said, these templates have hierarchy style tags added to them. What I need to do, is get these templates from the database, and then programmatically find these tags, their function name (or tag name), and their inner contents, as well as anything following the function (tag) name within the brackets.. An example of one of these tags is, {{FunctionName some (otherStuff) !Here}} Some content sits inside and it ends {{/FunctionName}}
This is where it gets more fun, the templates have another random tag, which I am guessing are the "variable" style of these tags, as they are always generally the same syntax. Which looks like this, ${RandomTag}, but also there are times that the function style one is there but without an ending tag, like so.. {{RandomLoner}}
Example Template...
{{FunctionTag (Condition?)}}
<div>This is an {{CheckOfSomeSort someTimesThese !orThese}}
example of some {{Random}} data
{{/CheckOfSomeSort}} that will be ${worked} on</div>
{{/FunctionTag}}
Ok so in no way is this a real template, but it follows all the rules that I have seen thus far.
Now I have tried different things with regex and preg_match_all to pull out the matches, and get each of these into a nice array. So far what I have got is this (used it on the example template to make sure its working still)
Array
(
[0] => Array
(
[0] => {{CheckOfSomeSort someTimesThese !orThese}}example of some datas{{/CheckOfSomeSort}}
[1] => {{CheckOfSomeSort someTimesThese !orThese}}
[2] => CheckOfSomeSort
[3] => example of some data
[4] => {{/CheckOfSomeSort}}
)
)
I have tried a couple approaches, (that took me nearly 8 hours to get to)
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})(?:(?=([^{{]+))\3|{{(?!\2[^}}]*}}))*?({{\/\2}})/
AND, more recently...
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})((?:(?!\{\{|\}\}).)++|(?R)*)({{\/\2}})/
In no way am I a guru with regexp, I actually just learned it over the last day or so, trying to get this to work. I have googled for this, and realize that regexp is not designed for nested stuff, but the (?R) seems to do the trick on simple bracket examples Ive seen on the internets, but they always only take into account of the stuff between the { and } or ( and ) or < and >. After reading nearly the whole regex info website, and playing, I came up with these 2 versions.
So what I NEED to do (I think), would have a regexp work from the DEEPEST hierarchy tag first, and work its way out (if I can do that with help from php, thats fine with me). I was thinking finding the Deepest layer, get its data, and work backwards til all the contents are in 1 fat array. I assumed that was what the ($R) was going to do for me, but it didn't.
So any help on what I am missing would be great, also take into note that mine seems to have issues with {{}} that DONT have an ending version of it. So like my {{Random}} example, was removed for the sake of me parsing the array example. I feel these tags, along with the ${} tags can be left alone (if I knew how to do that with regexp), and just remain in the text where they are. I am more or less interested in the functions and getting their data into a multidimensional array for me to work with further.
Sorry for the long post, I just have been banging me head all night with this. I started with the assumption that it was going to be a bit easier.. Til I realized the tags where nested :/
Any help is appreciated! Thanks!
Wow, what a strange templating syntax.
The method I would probably use to tackle this problem would be something like:
Use a simple regex to change all the {{tags}} to <tags>
Use another simple regex to convert the space-delimited arguments/conditions inside tags to XML-like attribute syntax (ex. {{foo bar !baz}} would become <foo arg1="bar" arg2="!baz"> or similar)
Process it as a DOMDocument.
Have fun. :-)
Warning! You are trying to write a parser with just regular expressions. That doesn't work very well. Why not? Because you need to store state as well!
So what then? Well, you write a parser of course :D
If you need any tips on how to get started I can help but I'd encourage you to try it by yourself first. How does a parser work anyway? :)
Tokenize your input. And transform it to a nested tree like so:
array(
array("code", "FunctionTag (Condition?)", array(
"<div>This is an ",
array("code", "CheckOfSomeSort someTimesThese !orThese", array(
"example of some ",
array("code", array("Random"), array()),
" data"
)),
" that will be ${worked} on</div>"
))
)
Now you just have to interpret the code parts and produce the expected output. You could also add things like line numbers and character positions which is very useful for debugging.
After a bit of time working on it, I ultimately learned more about regex, and understand it to a T now. Great thing about this, is that PHP has the (?R) and I now understand why it even looks like this. lol
In the end, the regex that I got working spawned off the php page that explained the recursive (?R). I then just worked on getting the tags regex in place of the parenthesis they were using in the example.
I know I wanted the inner most tag, but ofcourse can accomplish the same thing with the outermost tag, so this regex does just that. It finds and grabs the outer most {{tag (thatMightHaveDataHere)}} And has inner contents that may be more {{TAGS}} within it.{{/tag}}
Here it is,
/{{([\w]+) ?([^}]*?)(?:}}((?:[^{]*?|(?R)|{{[\w]*?}}|\${.*?})*){{\/\1}})/
0 = Matched "Outer Tag"
1 = Tag that was found, ie {{tag}}{{/\1}}
2 = Any data after the first space, within the tag, ie {{tag ThisDataIs StoredAs2}}
3 = INNER Content (which can be the recursive of this regex, or a non ended tag {{noEndTag}}, or a tag that starts with a dollar ${likeThis}
Run a loop on the $match[3] with this regex, and you can cycle through finding them. Not sure where you would use this outside what I needed it for, but I am sure someone can modify it if they need it to work on some other nested style structure.

Stuck with regexp

I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?
if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [
For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.
It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.
Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks
In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/

PHP - I need some help with my Regex

I've created a simple template 'engine' in PHP to substitute PHP-generated data into the HTML page. Here's how it works:
In my main template file, I have variables like so:
<title><!-- %{title}% --></title>
I then assign data into those variables for the main page load
$assign = array (
'title' => 'my website - '
);
I then have separate template blocks that get loaded for the content pages. The above really just handles the header and the footer. In one of these 'content template files', I have variables like so:
<!-- %{title=content page}% -->
Once this gets executed, the main template data is edited to include the content page variables resulting in:
<title>my website - content page</title>
It does this with the following code:
if (preg_match('/<!-- %{title=\s*(.*?)}% -->/s', $string, $matches)) {
// Find variable names in the form of %{varName=new data to append}%
// If found, append that new data to the exisiting data
$string = preg_replace('/<!-- %{title=\s*(.*?)}% -->/s', null, $string);
$varData[$i] .= $matches[1];
}
This basically removes the template variables and then assigns the variable data to the existing variable. Now, this all works fine. What I'm having issues with is nesting template variables. If I do something like:
<!-- %{title=content page (author: <!-- %{name}% -->) -->
The pattern, at times, messes up the opening and closing tags of each variable.
How can I fix my regular expression to prevent this?
Thank you.
The answer is you don't do this with regex. Regular expressions are a regular language. When you start nesting things it is no longer a regular language. It is, at a minimum, a context-free language ("CFL"). CFLs can only be processed (assuming they're unambiguous) with a stack.
Specifically, regular languages can be represented with a finite state machine ("FSM"). CFLs require a pushdown automaton ("PDA").
An example of the difference is nested tags in HTML:
<div>
<div>inner</div>
</div>
My advice is don't write your own template language. That's been done. Many times. Use Smarty or something in Zend, Kohana or whatever. If you do write your own, do it properly. Parse it.
Why are you rolling your own template engine? If you want this kind of complexity, there's a lot of places that have already come up with solutions for it. You should just plug in Smarty or something like that.
If you're asking what I think you're asking, it's literally impossible. If I read your question correctly, you want to match arbitrarily-nested <!-- ... --> sequences with particular things inside. Unfortunately, regular expressions can only match certain classes of strings; any regular expression can match only a regular language. One well-known language which is not regular is the language of balanced parentheses (also known as the the Dyck language), which is exactly what you're trying to match. In order to match arbitrarily-nested comment strings, you need a more powerful tool. I'm fairly sure there are pre-existing PHP template engines; you might look into one of those.
To resolve your problem you should
replace preg_match() with preg_match_all();
find the pattern, and replace them from the last one to the first one;
use a more restrictive pattern like '/<!-- %{title=\s*([^}]*?)}% -->/s'.
I've done something similar in the past, and I have encountered the same nesting issue you did. In your case, what I would do is repeatedly search your text for matches (rather than searching once and looping through the matches) and extract the strings you want by searching for anything that doesn't include your closing string.
In your case, it would probably look like this:
/(<!--([^(-->)]*?)-->)/
Regexes like this are a nightmare to explain, but basically, ([^(-->)]*) will find any string that doesn't include your closing tag (let's call that AAA). It will be inside a matching group that is, itself, your template tag, (<!--AAA-->).
I'm convinced this sort of templating method is the wrong way to do things, but I've never known enough to do it better. It's always bothered me in ASP and ColdFusion that you had to nest your scripting tags inside HTML and when I started to do it myself, I considered it a personal failure.
Most Regexes I do now are in JavaScript and so I may be missing some of the awesome nuances PHP has via Perl. I'd be happy if someone can write this more cleanly.
I too have ran into this problem in the past, although I didn't use regular expressions.
If instead you search from right to left for the opening tag, <!-- %{ in your syntax, using strrpos (PHP5+), then search forwards for the first occurrence of the next closing tag, and then replace that chunk first, you will end up replacing the inner-most nested variables first. This should resolve your problem.
You can also do it the other way around and find the first occurrence of a closing tag, and work backwards to find its corresponding opening tag.

Categories