I'd like to work on a bbcode filter for a php website. (I'm using cakephp, it would be a bbcode helper)
I have some requirement.
Bbcodes can be nested. So something like that is valid.
[block]
[block]
[/block]
[block]
[block]
[/block]
[/block]
[/block]
Bbcodes can have 0 or more parameters.
Exemple:
[video: url="url", width="500", height="500"]Title[/video]
Bbcodes might have mutliple behaviours.
Let say, [url]text[/url] would be transformed to [url:url="text"]text[/url]
or the video bbcode would be able to choose between youtube, dailymotion....
I think it cover most of my needs. I alreay done something with regex. But my biggest problem was to match parameters. In fact, I got nested bbcode to work and bbcode with 0 parameters. But when I added a regex match for parameters it didn't match nested bbcode correctly.
"\[($tag)(=.*)\"\](.*)\[\/\1\]" // It wasn't .* but the non-gready matcher
I don't have the complete regex with me right now, But I had something that looked like that(above).
So is there a way to match bbcode efficiently with regex or something else.
The only thing I can think of is to use the visitor pattern and to split my text with each possible tags this way, I can have a bit more of control over my text parsing and I could probably validate my document so if the input text doesn't have valid bbcode. I could Notify the user with a error before saving anything.
I would use sablecc to create my text parser.
http://sablecc.org/
Any better idea? or anything that could lead to a efficient flexible bbcode parser?
Thank you and sorry for my bad english...
There are several existing libraries for parsing BBCode, it may be easier to look into those than trying to roll your own:
Here's a couple, I'm sure there are more if you look around:
PECL bbcode
PEAR HTML_BBCodeParser
Been looking into bbcode parsers myself. Most of them use regex and PHP4 and produce errors on PHP 5.2+ or don't work at all. PECL bbcode and PEAR HTML_BBCodeParser don't appear to be maintained any more (late 2012) and aren't easily installed on the shared hosting setup I have to work with. StringParser_BBCode works with some minor tweaks for 5.2+ but the method for adding new tags is clumsy, and it was last updated in 2008.
Buried on the 4th page of of a Bing search (I was getting desperate) I found jBBCode, which appears new and requires PHP 5.3. MIT Lisence. I have yet to try building custom tags, but so far it is the only one I've tried that works out of the box on a shared hosting account with PHP 5.3.
There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.
If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but
Run the code through htmlspecialchars to escape any entities that need escaping
Transform all [ and ] characters into < and > respectively
Don't forget to account for the colon in cases like [tagname:
If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)
Responding to: "Any better idea?" (and I'm assuming that this was an invite not just for improvement over bbcode-specific suggestions)
We recently looked at going the bbcode route and decided on using htmlpurifier instead. This decision was based in part on the (admittedly biased probably) comparisons between various methods listed by the htmlpurifier group here and the discussion of bbcode (again, by the htmlpurifer group) here
And for the record I think your english was very good. I'm sure it's much better than I could do in your native language.
Use preg_split() with PREG_DELIM_CAPTURE flag to split source code into tags and non-tags. Then iterate over tags keeping stack of open blocks (i.e. when you see opening tag, add it to an array. When you see closing tag, remove elements from end of the array until closing tag matches opening tag.)
Related
I'd like to work on a bbcode filter for a php website. (I'm using cakephp, it would be a bbcode helper)
I have some requirement.
Bbcodes can be nested. So something like that is valid.
[block]
[block]
[/block]
[block]
[block]
[/block]
[/block]
[/block]
Bbcodes can have 0 or more parameters.
Exemple:
[video: url="url", width="500", height="500"]Title[/video]
Bbcodes might have mutliple behaviours.
Let say, [url]text[/url] would be transformed to [url:url="text"]text[/url]
or the video bbcode would be able to choose between youtube, dailymotion....
I think it cover most of my needs. I alreay done something with regex. But my biggest problem was to match parameters. In fact, I got nested bbcode to work and bbcode with 0 parameters. But when I added a regex match for parameters it didn't match nested bbcode correctly.
"\[($tag)(=.*)\"\](.*)\[\/\1\]" // It wasn't .* but the non-gready matcher
I don't have the complete regex with me right now, But I had something that looked like that(above).
So is there a way to match bbcode efficiently with regex or something else.
The only thing I can think of is to use the visitor pattern and to split my text with each possible tags this way, I can have a bit more of control over my text parsing and I could probably validate my document so if the input text doesn't have valid bbcode. I could Notify the user with a error before saving anything.
I would use sablecc to create my text parser.
http://sablecc.org/
Any better idea? or anything that could lead to a efficient flexible bbcode parser?
Thank you and sorry for my bad english...
There are several existing libraries for parsing BBCode, it may be easier to look into those than trying to roll your own:
Here's a couple, I'm sure there are more if you look around:
PECL bbcode
PEAR HTML_BBCodeParser
Been looking into bbcode parsers myself. Most of them use regex and PHP4 and produce errors on PHP 5.2+ or don't work at all. PECL bbcode and PEAR HTML_BBCodeParser don't appear to be maintained any more (late 2012) and aren't easily installed on the shared hosting setup I have to work with. StringParser_BBCode works with some minor tweaks for 5.2+ but the method for adding new tags is clumsy, and it was last updated in 2008.
Buried on the 4th page of of a Bing search (I was getting desperate) I found jBBCode, which appears new and requires PHP 5.3. MIT Lisence. I have yet to try building custom tags, but so far it is the only one I've tried that works out of the box on a shared hosting account with PHP 5.3.
There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.
If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but
Run the code through htmlspecialchars to escape any entities that need escaping
Transform all [ and ] characters into < and > respectively
Don't forget to account for the colon in cases like [tagname:
If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)
Responding to: "Any better idea?" (and I'm assuming that this was an invite not just for improvement over bbcode-specific suggestions)
We recently looked at going the bbcode route and decided on using htmlpurifier instead. This decision was based in part on the (admittedly biased probably) comparisons between various methods listed by the htmlpurifier group here and the discussion of bbcode (again, by the htmlpurifer group) here
And for the record I think your english was very good. I'm sure it's much better than I could do in your native language.
Use preg_split() with PREG_DELIM_CAPTURE flag to split source code into tags and non-tags. Then iterate over tags keeping stack of open blocks (i.e. when you see opening tag, add it to an array. When you see closing tag, remove elements from end of the array until closing tag matches opening tag.)
I am aware that it is public opinion to not use RegEx for parsing HTML; however I do not see how it would be harmful to use RegEx (alike functions have been added in previous Scripting Languages using RegEx such as _StringBetween( ) in AutoIt3) for what I want to achieve.
I am also aware that _StringBetween( ) was not specifically written for HTML but I have been using it without any problem on HTML content for the past 8 years along with other folks.
For my HTML Extraction API I would like to present the following piece of HTML
<div class="video" id="video-91519"><!-- The value of the identifier is dynamic-->
<a href="about:blank"><img src="silly.jpg"><!-- So is the href and src in a, img -->
</div>
The reason for the API I am trying to write is to make extraction of the video_url and thumbnail extremely easy and therefore a HTML parser seems out of reach. I would like to be able to extract it using something amongst the lines of
<div class="video" id="video-{{unknown}}">{{unknown}}<a href="{{video_url}}"><img src="{{thumbnail}}">{{unknown}}</div>
Of course, in the previous piece of HTML you could do it much easier such as
<a href="{{video_url}}"><img src="{{thumbnail}}">
but I was trying to present a perfect example to avoid confusion.
How does RegEx come into play? Well, I was going to replace {{video_url}}, {{thumbnail}} and {{unknown}} with (.*?), (.*?) and .* using /s and of course making sure that there are no multiple occurences of {{video_url}} and {{thumbnail}} in the provided input (not the HTML).
So, is there any reason for me not to use RegEx or still go for a HTML parser incl. proof of concepts of either acceptable RegEx and/ or using a HTML parser? I cannot personally see how to make this happen using a HTML parser
I think the way you have framed the problem pre-supposes the solution: if you want to be able to specify a pattern to match, then you have to use a pattern-matching language, such as regex. But if you frame the question as allowing searches for content in the document, then other options might be available, such as a path-based input that compiled to XPath expressions, or CSS selectors as used very successfully by the likes of jQuery.
What you are building here is not really an HTML extraction API as such, but a regex processing API - you are inventing a simplified pattern-matching language which can be compiled to a regex, and that regex applied to any string.
That in itself is not a bad thing, but if the users of that pattern-matching API try to use it to parse a more complex or unpredictable document, they will have all the same problems that everyone has when they try to match HTML using regexes, plus additional limitations imposed by your pre-processor. Those limitations are an inevitable consequence of simplifying the language: you are trading some of the power of the regex engine in order to make your patterns more "user friendly".
To return to the idea of reframing the question, here is an example of a simplified matching API that could compile to CSS expressions (e.g. used with SimpleHTMLDOM):
Find: div (class:video)
Must-Contain: a, img
Match: id Against video-{{video_id}}
Child: a
Extract: href Into video_url
Child: img
Extract: src Into thumbnail
Note that this language is a lot more abstracted away from the HTML; this has advantages and disadvantages. On the one hand, the simple matching pattern in your question is easy to create based on a single example. On the other hand, it is much more fragile to variations in the HTML, either due to changes in the site, or in-page variations such as adding an extra CSS class of "featured-video" to a small number of videos. The selector-based example requires the user to learn more specifics of the API, but if they don't know HTML and pattern-matching to begin with, a verbose syntax may be more helpful to them than one involving lots of cryptic punctuation.
I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.
I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.
I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).
The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!
UPDATE:
Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.
FURTHER UPDATE:
I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.
As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.
Relevant questions
Robust and Mature HTML Parser for PHP
How do you parse and process HTML/XML in PHP?
Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.
I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...
I've managed to successfuly integrate BBCode, but I was wondering say if I wanted to dynamically list all the allowed/accepted BBCode - how would I be able to do that? (as it can be tedious manually writing out...and if the BBCode ever changed I'd have to update the writing)
I current have a BBCode() function, which contains 2 arrays, one which contains the regex, and the other which contains the replacements (html), and then I return a preg_replace() of the regex array with the replacement (html) array.
Cheers and looking forward to your inputs!
Consider using a different markup language like Textile or Markdown. Simply saying that you support Markdown or Textile is decent enough; they're so widely used that users could easily look up the markup for them online.
Textile's syntax hasn't been updated since 2006, so it will likely remain very solid for years to come. Markdown's syntax hasn't been updated since 2004.
Both provide excellent PHP libraries:
http://michelf.com/projects/php-markdown/
http://textile.thresholdstate.com/