I am stuck working on a function that translates HTML to bbcode.
I have written my own [spoiler] bbcode tag which translates properly into the HTML equivalent.
But when I try to turn it back into bbcode it doesn't seem to match seemingly identical strings...
After slowly rebuilding it piece by piece to see where the problem is, it turns out that it only fails when I add onclick="showSpoiler(this)"
to
#<div><input type="button" onclick="showSpoiler(this)"/><div>(.*)</div></div>#ig'
I narrowed it further down to the ( brackets. I have tried to escape them like this \(
the html code that is generated from the [spoiler] tag is:
`$1
and the string that it is matched against is this
'#<div><input type="button" onclick="showSpoiler(this)"/><div>(.*)</div></div>#ig'
here are the conversion functions
<?php
//This function let convert BBcode to HTML
function bbcode_to_html($text)
{
$text = nl2br(htmlentities($text, ENT_QUOTES, 'UTF-8'));
$in = array(
'#\[b\](.*)\[/b\]#Usi',
'#\[i\](.*)\[/i\]#Usi',
'#\[u\](.*)\[/u\]#Usi',
'#\[s\](.*)\[/s\]#Usi',
'#\[img\](.*)\[/img\]#Usi',
'#\[url\]((ht|f)tps?\:\/\/(.*))\[/url\]#Usi',
'#\[url=((ht|f)tps?\:\/\/(.*))\](.*)\[/url\]#Usi',
'#\[left\](.*)\[/left\]#Usi',
'#\[center\](.*)\[/center\]#Usi',
'#\[right\](.*)\[/right\]#Usi',
'#\[spoiler\](.*)\[/spoiler\]#Usi',
'#\[fuck\](.*)\[/fuck\]#Usi'
);
$out = array(
'<strong>$1</strong>',
'<em>$1</em>',
'<span style="text-decoration:underline;">$1</span>',
'<span style="text-decoration:line-through;">$1</span>',
'<img src="$1" alt="Image" />',
'$1',
'$4',
'<div style="text-align:left;">$1</div>',
'<div style="text-align:center;">$1</div>',
'<div style="text-align:right;">$1</div>',
'<div><input type="button" onclick="showSpoiler(this)" value="Show/Hide" /><div class="inner" style="display:none;">$1</div></div>',
'<div><input type="button" onclick="showSpoiler(this)"/><div>$1</div></div>'
);
$count = count($in)-1;
for($i=0;$i<=$count;$i++)
{
$text = preg_replace($in[$i],$out[$i],$text);
}
return $text;
}
//This function let convert HTML to BBcode
function html_to_bbcode($text)
{
$text = str_replace('<br />','',$text);
$in = array(
'#<strong>(.*)</strong>#Usi',
'#<em>(.*)</em>#Usi',
'#<span style="text-decoration:underline;">(.*)</span>#Usi',
'#<span style="text-decoration:line-through;">(.*)</span>#Usi',
'#<img src="(.*)" alt="Image" />#Usi',
'#(.*)#Usi',
'#<div style="text-align:left;">(.*)</div>#Usi',
'#<div style="text-align:center;">(.*)</div>#Usi',
'#<div style="text-align:right;">(.*)</div>#Usi',
'#<div><input type="button" onclick="showSpoiler(this)" value="Show/Hide" /><div class="inner" style="display:none;">(.*)</div></div>#Ui',
'#<div><input type="button" onclick="showSpoiler(this)"/><div>(.*)</div></div>#ig'
);
$out = array(
'[b]$1[/b]',
'[i]$1[/i]',
'[u]$1[/u]',
'[s]$1[/s]',
'[img]$1[/img]',
'[url=$1]$2[/url]',
'[left]$1[/left]',
'[center]$1[/center]',
'[right]$1[/right]',
'[spoiler]$1[/spoiler]',
'[fuck]$1[/fuck]'
);
$count = count($in)-1;
for($i=0;$i<=$count;$i++)
{
$text = preg_replace($in[$i],$out[$i],$text);
}
return $text;
}
?>
In your regex you need to escape the braces like so:
showSpoiler\(this\)
Take care with regular expressions, they are a language on it's own and hard to debug unless you add more functions that do the debugging (e.g. what was matched, output that etc.).
BTW you can run multiple search and replace operations by directly passing the arrays into the function. You don't need to iterate over them.
So better read the manual page about preg_replace again and look forward how you can more easily debug your patterns. E.g. test them before putting them into the function and similar.
Related
$content = $this->comment->getContent(true);
$bbcodes = array (
'#\[cytat=(.*?)\](.*?)\[/cytat\]#' => '<div class="cytata">\\1 napisał/a </div> <div class="cytatb">\\2</div>',
'#\[cytat\](.*?)\[/cytat\]#' => '<div class="cytata">cytat</div><div class="cytatb">\\1</div>',
);
$content = preg_replace(array_keys($bbcodes), array_values($bbcodes), $content);
That preg_replace is not replacing every tag like that should.
For example if there will be only one tag [cytat]some text[/cytat] (cytat means quote in polish) then everything will be ok and the output will be
<div class="cytata">author napisał/a </div> <div class="cytatb">some text</div>
but there will be more than a one quote then preg is replacing only one tag, for example
<div class="cytata">o0skar napisał/a </div> <div class="cytatb">[cytat=o0skar]test nr2</div>[/cytat]
thats the output of the double quote, etc. Any ideas? Something wrong?
Maybe I can put preg_replace in while loop, but i dont know if preg_replace returns any variable.
For the sake of regular expressions awesomeness, let's look at this one. I had to change the pattern by 1 character. I removed one of the lazy ? and made this a preg_replace_callback
function pregcallbackfunc($matches){
$pattern = '#\[cytat=(.*?)\](.*)\[/cytat\]#';
if(preg_match($pattern, $matches[2])){
$matches[2] = preg_replace_callback($pattern,'pregcallbackfunc', $matches[2]);
}
if($matches[2]){
return '<div class="cytata">'.$matches[1].' napisał/a </div> <div class="cytatb">'.$matches[2].'</div>';
}
return '<div class="cytata">cytat</div><div class="cytatb">'.$matches[1].'</div>';
}
$content = '[cytat=o0skar][cytat=o0skar]test nr2[/cytat][/cytat]';
$content = preg_replace_callback('#\[cytat=(.*?)\](.*)\[/cytat\]#', 'pregcallbackfunc', $content);
Making this recursive will guarantee any level of nested quotes.
I'm looking for a regex that will be able to replace all links like Link with a warning. I've been having a play but no success so far! I've always been bad with regex, can someone point me in the right direction? I have this so far:
Edit: People saying don't use Regex - the HTML will be the output of a markdown parser with all HTML tags in the markdown stripped. Therefore i know that the output of all links will be formatted as stated above, therefore regex would surely be a good tool in this particular situation. I am not allowing users to enter pure HTML. And SO has done something very similar, try creating a javascript link, and it will be removed
<?php
//Javascript link filter test
if(isset($_POST['jsfilter'])){
$html = " JS Link ";
$pattern = "/ href\\s*?=\\s*?[\"']\\s*?(javascript)\\s*?(:).*?([\"']) /is";
$replacement = "\"javascript: alert('Javascript links have been blocked');\"";
$html = preg_replace($pattern, $replacement, $html);
echo $html;
}
?>
<form method="post">
<input type="text" name="jsfilter" />
<button type="submit">Submit</button>
</form>
The right regex should be :
$pattern = '/href="javascript:[^"]+"/';
$replacement = 'href="javascript:alert(\'Javascript links have been blocked\')"';
Use strip_tags and htmlSpecialChars() to display user generated content. If you want to let users use specific tags, refer to BBcode.
You should test quote and double quotes, handle white spaces, etc...
$html = preg_replace( '/href\s*=\s*"javascript:[^"]+"/i' , 'href="#"' , $html );
$html = preg_replace( '/href\s*=\s*\'javascript:[^i]+\'/i' , 'href=\'#\'' , $html );
Try this code. I think, this would help.
<?php
//Javascript link filter test
if(isset($_POST['jsfilter'])){
$html = " JS Link ";
$pattern = '/a href="javascript:(.*?)"/i';
$replacement = 'a href="javascript: alert(\'Javascript links have been blocked\');"';
$html = preg_replace($pattern, $replacement, $html);
echo $html;
}
?>
Here is the line of code I have which works great:
$content = htmlspecialchars($_POST['content'], ENT_QUOTES);
But what I would like to do is allow only certain types of HTML code to pass through without getting converted. Here is the list of HTML code that I would like to have pass:
<pre> </pre>
<b> </b>
<em> </em>
<u> </u>
<ul> </ul>
<li> </li>
<ol> </ol>
And as I go, I would like to also be able to add in more HTML later as I think of it. Could someone help me modify the code above so that the specified list of HTML codes above can pass through without getting converted?
I suppose you could do it after the fact:
// $str is the result of htmlspecialchars()
preg_replace('#<(/?(?:pre|b|em|u|ul|li|ol))>#', '<\1>', $str);
It allows the encoded version of <xx> and </xx> where xx is in a controlled set of allowed tags.
Or you can go with old style:
$content = htmlspecialchars($_POST['content'], ENT_QUOTES);
$turned = array( '<pre>', '</pre>', '<b>', '</b>', '<em>', '</em>', '<u>', '</u>', '<ul>', '</ul>', '<li>', '</li>', '<ol>', '</ol>' );
$turn_back = array( '<pre>', '</pre>', '<b>', '</b>', '<em>', '</em>', '<u>', '</u>', '<ul>', '</ul>', '<li>', '</li>', '<ol>', '</ol>' );
$content = str_replace( $turned, $turn_back, $content );
I improved the way Jack attacks this issue. I added support for <br>, <br/> and anchor tags. The code will replace fist href="..." to allow only this attribute to be used.
$str = preg_replace(
array('#href="(.*)"#', '#<(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*")?/?)>#' ),
array( 'href="\1"', '<\1>' ),
$str
);
I made this function to sanitize all HTML special characters except for the HTML tags specified.
It first uses htmlspecialchars() to make the string safe, then it reverts the tags I want to be untouched.
The function supports attribute filtering as an option, however be careful to disable it if you care about possible XSS attacks.
I know regex is not efficient but for moderate string lengths it should be fine.
You can check the regex I used here
https://regex101.com/r/U6GQse/8
public function sanitizeHtml($string, $safeHtmlTags = array('b','i','u','br'), $filterAttributes = true)
{
$string = htmlspecialchars($string);
if ($filterAttributes) {
$replace = "<$1$2$4>";
} else {
$replace = "<$1$2$3$4>";
}
$string = preg_replace("/<\s*(\/?\s*)(".implode("|", $safeHtmlTags).")(\s?|\s+[\s\S]*?)(\/)?\s*>/", $replace, $string);
return $string;
}
// Example usage to answer the OP question
$str = "MY HTML CONTENT"
echo sanitizeHtml($str, array('pre','b','em','u','ul','li','ol'));
I liked Elwin's solution, but you probably want to:
Prevent Javascript: URL's in the href - or more likely: allow only http(s).
Make the regex globs non-greedy in case there are multiple <a href>'s in the content.
Here is the updated version:
$str = preg_replace(
array('#href="(https?://.*?)"#', '#<(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*?")?/?)>#' ),
array( 'href="\1"', '<\1>' ),
$str
);
You could use strip_tags
$exceptionString = '<pre>,</pre>,<b>,</b>,<em>,</em>,<u>,</u>,<ul>,</ul>,<li>,</li>,<ol>,</ol>';
$content = strip_tags($_POST['content'],$exceptionString );
I made a custom function which acts like bbCode. I'm using preg_replace and regex. The only problem is that if I use more than one bbCode formatting, then just only one works..
[align=center][img]myimagelink[/img][/align]
If I enter this line, then the image appears BUT the [align=center]image[/align] also. How can I avoid this problem?
$patterns[2] = '#\[align=(.*)\](.*)\[\/align\]#si';
$patterns[9] = '#\[img\](.*\.jpg)\[\/img\]#si';
$replacements[2] = '<table align=\1><tr><td align=\1>\2</td></tr></table>';//ALIGN
$replacements[9] = '<img src=\"$1\"/>';//image
Changing the .* expressions to non-greedy (.*?) will work for you.
Example:
$in = '[align=center][img]myimagelink[/img][/align]';
$patterns = array(
'~\[align=(left|right|center)\](.*?)\[/align\]~' => '<div style="text-align: $1">$2</div>',
'~\[img](.*?)\[/img\]~' => '<img src="$1" />',
);
$rep = preg_replace(array_keys($patterns), $patterns, $in);
echo htmlspecialchars($rep);
Rather than reinventing the wheel I recommend using an existing javascript library.
I believe StackOverflow uses Prettify to format user input.
As #nickb stated, your patterns are greedy. (.*) grabs everything. Try changing it to (.*?).
treat all tags as singles not pairs
$patterns[2] = '#\[align=(.*)\]#si';
$patterns[3] = '#\[\/align\]#si';
$patterns[9] = '#\[img\](.*\.jpg)\[\/img\]#si';
$replacements[2] = '<div align=\"$1\">';//ALIGN
$replacements[3] = '</div>';//ALIGN
$replacements[9] = '<img src=\"$1\"/>';//image
Alright so I am using a little bbcode function for a forum I have, working well, so if, in example, I put
[b]Text[/b]
it will print Text in bold.
My issue is, if I have that code:
[b]
Text[/b]
Well it will not work, and just print that as it's right now.
Here is an example of the function I am using:
function BBCode ($string) {
$search = array(
'#\[b\](.*?)\[/b\]#',
);
$replace = array(
'<b>\\1</b>',
);
return preg_replace($search , $replace, $string);
}
Then when echo'ing it:
.nl2br(stripslashes(BBCode($arr_thread_row[main_content]))).
So my question would be, what is necessary so the BBcode works with everything inside it, but no necessarily on the same line.
In example:
[b]
Text
[/b]
Would simply be
Text
Thank you for any help!
Alex
You need the multiline modifier, which makes your pattern something like #\[b\](.*?)\[/b\]#ms
(note the trailing m)
There is actually a pecl extension that parses BBcode, which would be faster and more secure than writing it from scratch yourself.
I use this... It should work.
$bb1 = array(
"/\[url\](.*?)\[\/url\]/is",
"/\[img\](.*?)\[\/img\]/is",
"/\[img\=(.*?)\](.*?)\[\/img\]/is",
"/\[url\=(.*?)\](.*?)\[\/url\]/is",
"/\[red\](.*?)\[\/red\]/is",
"/\[b\](.*?)\[\/b\]/is",
"/\[h(.*?)\](.*?)\[\/h(.*?)\]/is",
"/\[php\](.*?)\[\/php\]/is"
);
$bb2 = array(
'\\1',
'<img alt="" src="\\1"/>',
'<img alt="" class="\\1" src="\\2"/>',
'<a rel="nofollow" target="_blank" href="\\1">\\2</a>',
'<span style="color:#ff0000;">\\1</span>',
'<span style="font-weight:bold;">\\1</span>',
'<h\\1>\\2</h\\3>',
'<pre><code class="php">\\1</code></pre>'
);
$html = preg_replace($bb1, $bb2, $html);