Stuck with regexp - php

I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?

if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [

For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.

It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.

Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.

Related

Recursive Regex in PHP with variable names

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>
Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

Php regex match a string between two html tags with the tags been unknown

Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?

Looking for a regex to get the data stored within the angle <> brackets

Ok I tried to google it but couldn't find a solution so I am asking here. I am trying to save the HTML tags into a variable in php. I am trying to use preg_match but cannot find the right pattern(regex). I did find one regex '\s*(.*?)\s*>\s*'. This works ok on the functions-online site where I try it and gives me the whole tag i.e.i=<body> but when I try to run it in my programme I get
preg_match(): Delimiter must not be alphanumeric or backslash
It would be helpful if anyone could sort out this issue and even better if anyone could give the regex to get the data within the angle brackets(HTML tags)
Please also let me know if there is another method to store the html tags in php i.e.
<body>
then $var=body
RegEx match open tags except XHTML self-contained tags <-- Read the 1st answer if you are considering "parsing" HTML with regexes
You need to add so called delimiters: '/\s*(.*?)\s*>\s*/'
Ok Thanx to the link provided by killerx I did find a regex which could be use but it is not the best method but should work for my task
'\'<([a-z]+)[^>]*(?<!/)>\''
This should work. It will get the full tag in an array and the tag description in the other.
Thanx a ton for helping me out

regex, php, and the evil nested (?R)

UPDATE
So I am still messing with this, and have gotten as far as finding all the instances of tags, though I'd rather JUST find the deepest stacked instance, as life would be easier that way.. Anyway here is what I got..
/(({{)(?:(?=([^\/][^ ]*?))\3|(\/[\w])))([a-zA-Z0-9\$\'\"\s\#\%\^\&\!\.\_\+\=\-\\\*\(\)\ ]+?}})/
Are there ANY regexp guru's out there that could give me some pointers or a regexp that mimics what I need? Which is only getting the deepest stacked instance of a {{tag}} that ends like this {{//tag}}
ORIGINAL
Ok, so I have an issue I have seen others have, but with a different approach to it.. Or so I thought.. So I am curious if anyone else can help me solve this issue further..
I have a database full of templates that I need to work with in PHP, these templates are made and used by another system, and so there for can not be changed. With that said, these templates have hierarchy style tags added to them. What I need to do, is get these templates from the database, and then programmatically find these tags, their function name (or tag name), and their inner contents, as well as anything following the function (tag) name within the brackets.. An example of one of these tags is, {{FunctionName some (otherStuff) !Here}} Some content sits inside and it ends {{/FunctionName}}
This is where it gets more fun, the templates have another random tag, which I am guessing are the "variable" style of these tags, as they are always generally the same syntax. Which looks like this, ${RandomTag}, but also there are times that the function style one is there but without an ending tag, like so.. {{RandomLoner}}
Example Template...
{{FunctionTag (Condition?)}}
<div>This is an {{CheckOfSomeSort someTimesThese !orThese}}
example of some {{Random}} data
{{/CheckOfSomeSort}} that will be ${worked} on</div>
{{/FunctionTag}}
Ok so in no way is this a real template, but it follows all the rules that I have seen thus far.
Now I have tried different things with regex and preg_match_all to pull out the matches, and get each of these into a nice array. So far what I have got is this (used it on the example template to make sure its working still)
Array
(
[0] => Array
(
[0] => {{CheckOfSomeSort someTimesThese !orThese}}example of some datas{{/CheckOfSomeSort}}
[1] => {{CheckOfSomeSort someTimesThese !orThese}}
[2] => CheckOfSomeSort
[3] => example of some data
[4] => {{/CheckOfSomeSort}}
)
)
I have tried a couple approaches, (that took me nearly 8 hours to get to)
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})(?:(?=([^{{]+))\3|{{(?!\2[^}}]*}}))*?({{\/\2}})/
AND, more recently...
/({{([^\/].[^ ]*)(?:.[^ ][^{{]+)}})((?:(?!\{\{|\}\}).)++|(?R)*)({{\/\2}})/
In no way am I a guru with regexp, I actually just learned it over the last day or so, trying to get this to work. I have googled for this, and realize that regexp is not designed for nested stuff, but the (?R) seems to do the trick on simple bracket examples Ive seen on the internets, but they always only take into account of the stuff between the { and } or ( and ) or < and >. After reading nearly the whole regex info website, and playing, I came up with these 2 versions.
So what I NEED to do (I think), would have a regexp work from the DEEPEST hierarchy tag first, and work its way out (if I can do that with help from php, thats fine with me). I was thinking finding the Deepest layer, get its data, and work backwards til all the contents are in 1 fat array. I assumed that was what the ($R) was going to do for me, but it didn't.
So any help on what I am missing would be great, also take into note that mine seems to have issues with {{}} that DONT have an ending version of it. So like my {{Random}} example, was removed for the sake of me parsing the array example. I feel these tags, along with the ${} tags can be left alone (if I knew how to do that with regexp), and just remain in the text where they are. I am more or less interested in the functions and getting their data into a multidimensional array for me to work with further.
Sorry for the long post, I just have been banging me head all night with this. I started with the assumption that it was going to be a bit easier.. Til I realized the tags where nested :/
Any help is appreciated! Thanks!
Wow, what a strange templating syntax.
The method I would probably use to tackle this problem would be something like:
Use a simple regex to change all the {{tags}} to <tags>
Use another simple regex to convert the space-delimited arguments/conditions inside tags to XML-like attribute syntax (ex. {{foo bar !baz}} would become <foo arg1="bar" arg2="!baz"> or similar)
Process it as a DOMDocument.
Have fun. :-)
Warning! You are trying to write a parser with just regular expressions. That doesn't work very well. Why not? Because you need to store state as well!
So what then? Well, you write a parser of course :D
If you need any tips on how to get started I can help but I'd encourage you to try it by yourself first. How does a parser work anyway? :)
Tokenize your input. And transform it to a nested tree like so:
array(
array("code", "FunctionTag (Condition?)", array(
"<div>This is an ",
array("code", "CheckOfSomeSort someTimesThese !orThese", array(
"example of some ",
array("code", array("Random"), array()),
" data"
)),
" that will be ${worked} on</div>"
))
)
Now you just have to interpret the code parts and produce the expected output. You could also add things like line numbers and character positions which is very useful for debugging.
After a bit of time working on it, I ultimately learned more about regex, and understand it to a T now. Great thing about this, is that PHP has the (?R) and I now understand why it even looks like this. lol
In the end, the regex that I got working spawned off the php page that explained the recursive (?R). I then just worked on getting the tags regex in place of the parenthesis they were using in the example.
I know I wanted the inner most tag, but ofcourse can accomplish the same thing with the outermost tag, so this regex does just that. It finds and grabs the outer most {{tag (thatMightHaveDataHere)}} And has inner contents that may be more {{TAGS}} within it.{{/tag}}
Here it is,
/{{([\w]+) ?([^}]*?)(?:}}((?:[^{]*?|(?R)|{{[\w]*?}}|\${.*?})*){{\/\1}})/
0 = Matched "Outer Tag"
1 = Tag that was found, ie {{tag}}{{/\1}}
2 = Any data after the first space, within the tag, ie {{tag ThisDataIs StoredAs2}}
3 = INNER Content (which can be the recursive of this regex, or a non ended tag {{noEndTag}}, or a tag that starts with a dollar ${likeThis}
Run a loop on the $match[3] with this regex, and you can cycle through finding them. Not sure where you would use this outside what I needed it for, but I am sure someone can modify it if they need it to work on some other nested style structure.

How to match anything except a pattern between two tags

I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

Categories